From 8b01caee99fb07218908c0ac9be8c758878f33f9 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 7 Jun 2019 15:54:18 -0300 Subject: isdn: mISDN: remove a bogus reference to a non-existing doc The mISDN driver was added on those commits: 960366cf8dbb ("Add mISDN DSP") 1b2b03f8e514 ("Add mISDN core files") 04578dd330f1 ("Define AF_ISDN and PF_ISDN") e4ac9bc1f668 ("Add mISDN driver") None of them added a Documentation/isdn/mISDN.cert file. Also, whatever were supposed to be written there on that time, probably doesn't make any sense nowadays, as I doubt isdn would have any massive changes. So, let's just get rid of the broken reference, in order to shut up a warning produced by ./scripts/documentation-file-ref-check. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- drivers/isdn/mISDN/dsp_core.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'drivers') diff --git a/drivers/isdn/mISDN/dsp_core.c b/drivers/isdn/mISDN/dsp_core.c index cd036e87335a..038e72a84b33 100644 --- a/drivers/isdn/mISDN/dsp_core.c +++ b/drivers/isdn/mISDN/dsp_core.c @@ -4,8 +4,6 @@ * Karsten Keil (keil@isdn4linux.de) * * This file is (c) under GNU PUBLIC LICENSE - * For changes and modifications please read - * ../../../Documentation/isdn/mISDN.cert * * Thanks to Karsten Keil (great drivers) * Cologne Chip (great chips) -- cgit v1.2.3 From cb1aaebea8d79860181559d7b5d482aea63db113 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 7 Jun 2019 15:54:32 -0300 Subject: docs: fix broken documentation links Mostly due to x86 and acpi conversion, several documentation links are still pointing to the old file. Fix them. Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Wolfram Sang Reviewed-by: Sven Van Asbroeck Reviewed-by: Bhupesh Sharma Acked-by: Mark Brown Signed-off-by: Jonathan Corbet --- Documentation/acpi/dsd/leds.txt | 2 +- Documentation/admin-guide/kernel-parameters.rst | 6 +++--- Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++-------- Documentation/admin-guide/ras.rst | 2 +- Documentation/devicetree/bindings/net/fsl-enetc.txt | 7 +++---- .../devicetree/bindings/pci/amlogic,meson-pcie.txt | 2 +- .../bindings/regulator/qcom,rpmh-regulator.txt | 2 +- Documentation/devicetree/booting-without-of.txt | 2 +- Documentation/driver-api/gpio/board.rst | 2 +- Documentation/driver-api/gpio/consumer.rst | 2 +- Documentation/firmware-guide/acpi/enumeration.rst | 2 +- Documentation/firmware-guide/acpi/method-tracing.rst | 2 +- Documentation/i2c/instantiating-devices | 2 +- Documentation/sysctl/kernel.txt | 4 ++-- Documentation/translations/zh_CN/process/4.Coding.rst | 2 +- Documentation/x86/x86_64/5level-paging.rst | 2 +- Documentation/x86/x86_64/boot-options.rst | 4 ++-- Documentation/x86/x86_64/fake-numa-for-cpusets.rst | 2 +- MAINTAINERS | 4 ++-- arch/arm/Kconfig | 2 +- arch/arm64/kernel/kexec_image.c | 2 +- arch/x86/Kconfig | 14 +++++++------- arch/x86/Kconfig.debug | 2 +- arch/x86/boot/header.S | 2 +- arch/x86/entry/entry_64.S | 2 +- arch/x86/include/asm/bootparam_utils.h | 2 +- arch/x86/include/asm/page_64_types.h | 2 +- arch/x86/include/asm/pgtable_64_types.h | 2 +- arch/x86/kernel/cpu/microcode/amd.c | 2 +- arch/x86/kernel/kexec-bzimage64.c | 2 +- arch/x86/kernel/pci-dma.c | 2 +- arch/x86/mm/tlb.c | 2 +- arch/x86/platform/pvh/enlighten.c | 2 +- drivers/acpi/Kconfig | 10 +++++----- drivers/net/ethernet/faraday/ftgmac100.c | 2 +- drivers/staging/fieldbus/Documentation/fieldbus_dev.txt | 4 ++-- drivers/vhost/vhost.c | 2 +- include/acpi/acpi_drivers.h | 2 +- include/linux/fs_context.h | 2 +- include/linux/lsm_hooks.h | 2 +- mm/Kconfig | 2 +- security/Kconfig | 2 +- tools/include/linux/err.h | 2 +- tools/objtool/Documentation/stack-validation.txt | 4 ++-- 44 files changed, 70 insertions(+), 71 deletions(-) (limited to 'drivers') diff --git a/Documentation/acpi/dsd/leds.txt b/Documentation/acpi/dsd/leds.txt index 81a63af42ed2..cc58b1a574c5 100644 --- a/Documentation/acpi/dsd/leds.txt +++ b/Documentation/acpi/dsd/leds.txt @@ -96,4 +96,4 @@ where , referenced 2019-02-21. -[7] Documentation/acpi/dsd/data-node-reference.txt +[7] Documentation/firmware-guide/acpi/dsd/data-node-references.rst diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index 0124980dca2d..8d3273e32eb1 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -167,7 +167,7 @@ parameter is applicable:: X86-32 X86-32, aka i386 architecture is enabled. X86-64 X86-64 architecture is enabled. More X86-64 boot options can be found in - Documentation/x86/x86_64/boot-options.txt . + Documentation/x86/x86_64/boot-options.rst. X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) X86_UV SGI UV support is enabled. XEN Xen support is enabled @@ -181,10 +181,10 @@ In addition, the following text indicates that the option:: Parameters denoted with BOOT are actually interpreted by the boot loader, and have no meaning to the kernel directly. Do not modify the syntax of boot loader parameters without extreme -need or coordination with . +need or coordination with . There are also arch-specific kernel-parameters not documented here. -See for example . +See for example . Note that ALL kernel parameters listed below are CASE SENSITIVE, and that a trailing = on the name of any parameter states that that parameter will diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 79d043b8850d..1abd7e145357 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -53,7 +53,7 @@ ACPI_DEBUG_PRINT statements, e.g., ACPI_DEBUG_PRINT((ACPI_DB_INFO, ... The debug_level mask defaults to "info". See - Documentation/acpi/debug.txt for more information about + Documentation/firmware-guide/acpi/debug.rst for more information about debug layers and levels. Enable processor driver info messages: @@ -963,7 +963,7 @@ for details. nompx [X86] Disables Intel Memory Protection Extensions. - See Documentation/x86/intel_mpx.txt for more + See Documentation/x86/intel_mpx.rst for more information about the feature. nopku [X86] Disable Memory Protection Keys CPU feature found @@ -1189,7 +1189,7 @@ that is to be dynamically loaded by Linux. If there are multiple variables with the same name but with different vendor GUIDs, all of them will be loaded. See - Documentation/acpi/ssdt-overlays.txt for details. + Documentation/admin-guide/acpi/ssdt-overlays.rst for details. eisa_irq_edge= [PARISC,HW] @@ -2383,7 +2383,7 @@ mce [X86-32] Machine Check Exception - mce=option [X86-64] See Documentation/x86/x86_64/boot-options.txt + mce=option [X86-64] See Documentation/x86/x86_64/boot-options.rst md= [HW] RAID subsystems devices and level See Documentation/admin-guide/md.rst. @@ -2439,7 +2439,7 @@ set according to the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config option. - See Documentation/memory-hotplug.txt. + See Documentation/admin-guide/mm/memory-hotplug.rst. memmap=exactmap [KNL,X86] Enable setting of an exact E820 memory map, as specified by the user. @@ -2528,7 +2528,7 @@ mem_encrypt=on: Activate SME mem_encrypt=off: Do not activate SME - Refer to Documentation/x86/amd-memory-encryption.txt + Refer to Documentation/virtual/kvm/amd-memory-encryption.rst for details on when memory encryption can be activated. mem_sleep_default= [SUSPEND] Default system suspend mode: @@ -3529,7 +3529,7 @@ See Documentation/blockdev/paride.txt. pirq= [SMP,APIC] Manual mp-table setup - See Documentation/x86/i386/IO-APIC.txt. + See Documentation/x86/i386/IO-APIC.rst. plip= [PPT,NET] Parallel port network link Format: { parport | timid | 0 } @@ -5055,7 +5055,7 @@ Can be used multiple times for multiple devices. vga= [BOOT,X86-32] Select a particular video mode - See Documentation/x86/boot.txt and + See Documentation/x86/boot.rst and Documentation/svga.txt. Use vga=ask for menu. This is actually a boot loader parameter; the value is diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst index c7495e42e6f4..2b20f5f7380d 100644 --- a/Documentation/admin-guide/ras.rst +++ b/Documentation/admin-guide/ras.rst @@ -199,7 +199,7 @@ Architecture (MCA)\ [#f3]_. mode). .. [#f3] For more details about the Machine Check Architecture (MCA), - please read Documentation/x86/x86_64/machinecheck at the Kernel tree. + please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree. EDAC - Error Detection And Correction ************************************* diff --git a/Documentation/devicetree/bindings/net/fsl-enetc.txt b/Documentation/devicetree/bindings/net/fsl-enetc.txt index c812e25ae90f..25fc687419db 100644 --- a/Documentation/devicetree/bindings/net/fsl-enetc.txt +++ b/Documentation/devicetree/bindings/net/fsl-enetc.txt @@ -16,8 +16,8 @@ Required properties: In this case, the ENETC node should include a "mdio" sub-node that in turn should contain the "ethernet-phy" node describing the external phy. Below properties are required, their bindings -already defined in ethernet.txt or phy.txt, under -Documentation/devicetree/bindings/net/*. +already defined in Documentation/devicetree/bindings/net/ethernet.txt or +Documentation/devicetree/bindings/net/phy.txt. Required: @@ -51,8 +51,7 @@ Example: connection: In this case, the ENETC port node defines a fixed link connection, -as specified by "fixed-link.txt", under -Documentation/devicetree/bindings/net/*. +as specified by Documentation/devicetree/bindings/net/fixed-link.txt. Required: diff --git a/Documentation/devicetree/bindings/pci/amlogic,meson-pcie.txt b/Documentation/devicetree/bindings/pci/amlogic,meson-pcie.txt index 12b18f82d441..efa2c8b9b85a 100644 --- a/Documentation/devicetree/bindings/pci/amlogic,meson-pcie.txt +++ b/Documentation/devicetree/bindings/pci/amlogic,meson-pcie.txt @@ -3,7 +3,7 @@ Amlogic Meson AXG DWC PCIE SoC controller Amlogic Meson PCIe host controller is based on the Synopsys DesignWare PCI core. It shares common functions with the PCIe DesignWare core driver and inherits common properties defined in -Documentation/devicetree/bindings/pci/designware-pci.txt. +Documentation/devicetree/bindings/pci/designware-pcie.txt. Additional properties are described here: diff --git a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt index 7ef2dbe48e8a..14d2eee96b3d 100644 --- a/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt +++ b/Documentation/devicetree/bindings/regulator/qcom,rpmh-regulator.txt @@ -97,7 +97,7 @@ Second Level Nodes - Regulators sent for this regulator including those which are for a strictly lower power state. -Other properties defined in Documentation/devicetree/bindings/regulator.txt +Other properties defined in Documentation/devicetree/bindings/regulator/regulator.txt may also be used. regulator-initial-mode and regulator-allowed-modes may be specified for VRM regulators using mode values from include/dt-bindings/regulator/qcom,rpmh-regulator.h. regulator-allow-bypass diff --git a/Documentation/devicetree/booting-without-of.txt b/Documentation/devicetree/booting-without-of.txt index e86bd2f64117..60f8640f2b2f 100644 --- a/Documentation/devicetree/booting-without-of.txt +++ b/Documentation/devicetree/booting-without-of.txt @@ -277,7 +277,7 @@ it with special cases. the decompressor (the real mode entry point goes to the same 32bit entry point once it switched into protected mode). That entry point supports one calling convention which is documented in - Documentation/x86/boot.txt + Documentation/x86/boot.rst The physical pointer to the device-tree block (defined in chapter II) is passed via setup_data which requires at least boot protocol 2.09. The type filed is defined as diff --git a/Documentation/driver-api/gpio/board.rst b/Documentation/driver-api/gpio/board.rst index b37f3f7b8926..ce91518bf9f4 100644 --- a/Documentation/driver-api/gpio/board.rst +++ b/Documentation/driver-api/gpio/board.rst @@ -101,7 +101,7 @@ with the help of _DSD (Device Specific Data), introduced in ACPI 5.1:: } For more information about the ACPI GPIO bindings see -Documentation/acpi/gpio-properties.txt. +Documentation/firmware-guide/acpi/gpio-properties.rst. Platform Data ------------- diff --git a/Documentation/driver-api/gpio/consumer.rst b/Documentation/driver-api/gpio/consumer.rst index 5e4d8aa68913..fdecb6d711db 100644 --- a/Documentation/driver-api/gpio/consumer.rst +++ b/Documentation/driver-api/gpio/consumer.rst @@ -437,7 +437,7 @@ case, it will be handled by the GPIO subsystem automatically. However, if the _DSD is not present, the mappings between GpioIo()/GpioInt() resources and GPIO connection IDs need to be provided by device drivers. -For details refer to Documentation/acpi/gpio-properties.txt +For details refer to Documentation/firmware-guide/acpi/gpio-properties.rst Interacting With the Legacy GPIO Subsystem diff --git a/Documentation/firmware-guide/acpi/enumeration.rst b/Documentation/firmware-guide/acpi/enumeration.rst index 850be9696931..1252617b520f 100644 --- a/Documentation/firmware-guide/acpi/enumeration.rst +++ b/Documentation/firmware-guide/acpi/enumeration.rst @@ -339,7 +339,7 @@ a code like this:: There are also devm_* versions of these functions which release the descriptors once the device is released. -See Documentation/acpi/gpio-properties.txt for more information about the +See Documentation/firmware-guide/acpi/gpio-properties.rst for more information about the _DSD binding related to GPIOs. MFD devices diff --git a/Documentation/firmware-guide/acpi/method-tracing.rst b/Documentation/firmware-guide/acpi/method-tracing.rst index d0b077b73f5f..0aa7e2c5d32a 100644 --- a/Documentation/firmware-guide/acpi/method-tracing.rst +++ b/Documentation/firmware-guide/acpi/method-tracing.rst @@ -68,7 +68,7 @@ c. Filter out the debug layer/level matched logs when the specified Where: 0xXXXXXXXX/0xYYYYYYYY - Refer to Documentation/acpi/debug.txt for possible debug layer/level + Refer to Documentation/firmware-guide/acpi/debug.rst for possible debug layer/level masking values. \PPPP.AAAA.TTTT.HHHH Full path of a control method that can be found in the ACPI namespace. diff --git a/Documentation/i2c/instantiating-devices b/Documentation/i2c/instantiating-devices index 0d85ac1935b7..5a3e2f331e8c 100644 --- a/Documentation/i2c/instantiating-devices +++ b/Documentation/i2c/instantiating-devices @@ -85,7 +85,7 @@ Method 1c: Declare the I2C devices via ACPI ------------------------------------------- ACPI can also describe I2C devices. There is special documentation for this -which is currently located at Documentation/acpi/enumeration.txt. +which is currently located at Documentation/firmware-guide/acpi/enumeration.rst. Method 2: Instantiate the devices explicitly diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index f0c86fbb3b48..92f7f34b021a 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -155,7 +155,7 @@ is 0x15 and the full version number is 0x234, this file will contain the value 340 = 0x154. See the type_of_loader and ext_loader_type fields in -Documentation/x86/boot.txt for additional information. +Documentation/x86/boot.rst for additional information. ============================================================== @@ -167,7 +167,7 @@ The complete bootloader version number. In the example above, this file will contain the value 564 = 0x234. See the type_of_loader and ext_loader_ver fields in -Documentation/x86/boot.txt for additional information. +Documentation/x86/boot.rst for additional information. ============================================================== diff --git a/Documentation/translations/zh_CN/process/4.Coding.rst b/Documentation/translations/zh_CN/process/4.Coding.rst index 5301e9d55255..8bb777941394 100644 --- a/Documentation/translations/zh_CN/process/4.Coding.rst +++ b/Documentation/translations/zh_CN/process/4.Coding.rst @@ -241,7 +241,7 @@ scripts/coccinelle目录下已经打包了相当多的内核“语义补丁” 任何添加新用户空间界面的代码(包括新的sysfs或/proc文件)都应该包含该界面的 文档,该文档使用户空间开发人员能够知道他们在使用什么。请参阅 -Documentation/abi/readme,了解如何格式化此文档以及需要提供哪些信息。 +Documentation/ABI/README,了解如何格式化此文档以及需要提供哪些信息。 文件 :ref:`Documentation/admin-guide/kernel-parameters.rst ` 描述了内核的所有引导时间参数。任何添加新参数的补丁都应该向该文件添加适当的 diff --git a/Documentation/x86/x86_64/5level-paging.rst b/Documentation/x86/x86_64/5level-paging.rst index ab88a4514163..44856417e6a5 100644 --- a/Documentation/x86/x86_64/5level-paging.rst +++ b/Documentation/x86/x86_64/5level-paging.rst @@ -20,7 +20,7 @@ physical address space. This "ought to be enough for anybody" ©. QEMU 2.9 and later support 5-level paging. Virtual memory layout for 5-level paging is described in -Documentation/x86/x86_64/mm.txt +Documentation/x86/x86_64/mm.rst Enabling 5-level paging diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst index 2f69836b8445..6a4285a3c7a4 100644 --- a/Documentation/x86/x86_64/boot-options.rst +++ b/Documentation/x86/x86_64/boot-options.rst @@ -9,7 +9,7 @@ only the AMD64 specific ones are listed here. Machine check ============= -Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables. +Please see Documentation/x86/x86_64/machinecheck.rst for sysfs runtime tunables. mce=off Disable machine check @@ -89,7 +89,7 @@ APICs Don't use the local APIC (alias for i386 compatibility) pirq=... - See Documentation/x86/i386/IO-APIC.txt + See Documentation/x86/i386/IO-APIC.rst noapictimer Don't set up the APIC timer diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst index 74fbb78b3c67..04df57b9aa3f 100644 --- a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst @@ -18,7 +18,7 @@ For more information on the features of cpusets, see Documentation/cgroup-v1/cpusets.txt. There are a number of different configurations you can use for your needs. For more information on the numa=fake command line option and its various ways of -configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. +configuring fake nodes, see Documentation/x86/x86_64/boot-options.rst. For the purposes of this introduction, we'll assume a very primitive NUMA emulation setup of "numa=fake=4*512,". This will split our system memory into diff --git a/MAINTAINERS b/MAINTAINERS index 5cfbea4ce575..26e0369c1641 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3874,7 +3874,7 @@ F: Documentation/devicetree/bindings/hwmon/cirrus,lochnagar.txt F: Documentation/devicetree/bindings/pinctrl/cirrus,lochnagar.txt F: Documentation/devicetree/bindings/regulator/cirrus,lochnagar.txt F: Documentation/devicetree/bindings/sound/cirrus,lochnagar.txt -F: Documentation/hwmon/lochnagar +F: Documentation/hwmon/lochnagar.rst CISCO FCOE HBA DRIVER M: Satish Kharat @@ -11272,7 +11272,7 @@ NXP FXAS21002C DRIVER M: Rui Miguel Silva L: linux-iio@vger.kernel.org S: Maintained -F: Documentation/devicetree/bindings/iio/gyroscope/fxas21002c.txt +F: Documentation/devicetree/bindings/iio/gyroscope/nxp,fxas21002c.txt F: drivers/iio/gyro/fxas21002c_core.c F: drivers/iio/gyro/fxas21002c.h F: drivers/iio/gyro/fxas21002c_i2c.c diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 8869742a85df..0f220264cc23 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1263,7 +1263,7 @@ config SMP uniprocessor machines. On a uniprocessor machine, the kernel will run faster if you say N here. - See also , + See also , and the SMP-HOWTO available at . diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c index 07bf740bea91..31cc2f423aa8 100644 --- a/arch/arm64/kernel/kexec_image.c +++ b/arch/arm64/kernel/kexec_image.c @@ -53,7 +53,7 @@ static void *image_load(struct kimage *image, /* * We require a kernel with an unambiguous Image header. Per - * Documentation/booting.txt, this is the case when image_size + * Documentation/arm64/booting.txt, this is the case when image_size * is non-zero (practically speaking, since v3.17). */ h = (struct arm64_image_header *)kernel; diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d87d53fcd261..9f1f7b47621c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -395,7 +395,7 @@ config SMP Y to "Enhanced Real Time Clock Support", below. The "Advanced Power Management" code will be disabled if you say Y here. - See also , + See also , and the SMP-HOWTO available at . @@ -1290,7 +1290,7 @@ config MICROCODE the Linux kernel. The preferred method to load microcode from a detached initrd is described - in Documentation/x86/microcode.txt. For that you need to enable + in Documentation/x86/microcode.rst. For that you need to enable CONFIG_BLK_DEV_INITRD in order for the loader to be able to scan the initrd for microcode blobs. @@ -1329,7 +1329,7 @@ config MICROCODE_OLD_INTERFACE It is inadequate because it runs too late to be able to properly load microcode on a machine and it needs special tools. Instead, you should've switched to the early loading method with the initrd or - builtin microcode by now: Documentation/x86/microcode.txt + builtin microcode by now: Documentation/x86/microcode.rst config X86_MSR tristate "/dev/cpu/*/msr - Model-specific register support" @@ -1478,7 +1478,7 @@ config X86_5LEVEL A kernel with the option enabled can be booted on machines that support 4- or 5-level paging. - See Documentation/x86/x86_64/5level-paging.txt for more + See Documentation/x86/x86_64/5level-paging.rst for more information. Say N if unsure. @@ -1626,7 +1626,7 @@ config ARCH_MEMORY_PROBE depends on X86_64 && MEMORY_HOTPLUG help This option enables a sysfs memory/probe interface for testing. - See Documentation/memory-hotplug.txt for more information. + See Documentation/admin-guide/mm/memory-hotplug.rst for more information. If you are unsure how to answer this question, answer N. config ARCH_PROC_KCORE_TEXT @@ -1783,7 +1783,7 @@ config MTRR You can safely say Y even if your machine doesn't have MTRRs, you'll just add about 9 KB to your kernel. - See for more information. + See for more information. config MTRR_SANITIZER def_bool y @@ -1895,7 +1895,7 @@ config X86_INTEL_MPX process and adds some branches to paths used during exec() and munmap(). - For details, see Documentation/x86/intel_mpx.txt + For details, see Documentation/x86/intel_mpx.rst If unsure, say N. diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index f730680dc818..59f598543203 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -156,7 +156,7 @@ config IOMMU_DEBUG code. When you use it make sure you have a big enough IOMMU/AGP aperture. Most of the options enabled by this can be set more finegrained using the iommu= command line - options. See Documentation/x86/x86_64/boot-options.txt for more + options. See Documentation/x86/x86_64/boot-options.rst for more details. config IOMMU_LEAK diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S index 850b8762e889..90d791ca1a95 100644 --- a/arch/x86/boot/header.S +++ b/arch/x86/boot/header.S @@ -313,7 +313,7 @@ start_sys_seg: .word SYSSEG # obsolete and meaningless, but just type_of_loader: .byte 0 # 0 means ancient bootloader, newer # bootloaders know to change this. - # See Documentation/x86/boot.txt for + # See Documentation/x86/boot.rst for # assigned ids # flags, unused bits must be zero (RFU) bit within loadflags diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 11aa3b2afa4d..33f9fc38d014 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -8,7 +8,7 @@ * * entry.S contains the system-call and fault low-level handling routines. * - * Some of this is documented in Documentation/x86/entry_64.txt + * Some of this is documented in Documentation/x86/entry_64.rst * * A note on terminology: * - iret frame: Architecture defined interrupt frame from SS to RIP diff --git a/arch/x86/include/asm/bootparam_utils.h b/arch/x86/include/asm/bootparam_utils.h index f6f6ef436599..101eb944f13c 100644 --- a/arch/x86/include/asm/bootparam_utils.h +++ b/arch/x86/include/asm/bootparam_utils.h @@ -24,7 +24,7 @@ static void sanitize_boot_params(struct boot_params *boot_params) * IMPORTANT NOTE TO BOOTLOADER AUTHORS: do not simply clear * this field. The purpose of this field is to guarantee * compliance with the x86 boot spec located in - * Documentation/x86/boot.txt . That spec says that the + * Documentation/x86/boot.rst . That spec says that the * *whole* structure should be cleared, after which only the * portion defined by struct setup_header (boot_params->hdr) * should be copied in. diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 793c14c372cb..288b065955b7 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -48,7 +48,7 @@ #define __START_KERNEL_map _AC(0xffffffff80000000, UL) -/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */ +/* See Documentation/x86/x86_64/mm.rst for a description of the memory map. */ #define __PHYSICAL_MASK_SHIFT 52 diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 88bca456da99..52e5f5f2240d 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -103,7 +103,7 @@ extern unsigned int ptrs_per_p4d; #define PGDIR_MASK (~(PGDIR_SIZE - 1)) /* - * See Documentation/x86/x86_64/mm.txt for a description of the memory map. + * See Documentation/x86/x86_64/mm.rst for a description of the memory map. * * Be very careful vs. KASLR when changing anything here. The KASLR address * range must not overlap with anything except the KASAN shadow area, which diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c index e1f3ba19ba54..06d4e67f31ab 100644 --- a/arch/x86/kernel/cpu/microcode/amd.c +++ b/arch/x86/kernel/cpu/microcode/amd.c @@ -61,7 +61,7 @@ static u8 amd_ucode_patch[PATCH_MAX_SIZE]; /* * Microcode patch container file is prepended to the initrd in cpio - * format. See Documentation/x86/microcode.txt + * format. See Documentation/x86/microcode.rst */ static const char ucode_path[] __maybe_unused = "kernel/x86/microcode/AuthenticAMD.bin"; diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 22f60dd26460..b07e7069b09e 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -416,7 +416,7 @@ static void *bzImage64_load(struct kimage *image, char *kernel, efi_map_offset = params_cmdline_sz; efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16); - /* Copy setup header onto bootparams. Documentation/x86/boot.txt */ + /* Copy setup header onto bootparams. Documentation/x86/boot.rst */ setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset; /* Is there a limit on setup header size? */ diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index dcd272dbd0a9..f62b498b18fb 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -70,7 +70,7 @@ void __init pci_iommu_alloc(void) } /* - * See for the iommu kernel + * See for the iommu kernel * parameter documentation. */ static __init int iommu_setup(char *p) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 7f61431c75fb..400c1ba033aa 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -711,7 +711,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, } /* - * See Documentation/x86/tlb.txt for details. We choose 33 + * See Documentation/x86/tlb.rst for details. We choose 33 * because it is large enough to cover the vast majority (at * least 95%) of allocations, and is small enough that we are * confident it will not cause too much overhead. Each single diff --git a/arch/x86/platform/pvh/enlighten.c b/arch/x86/platform/pvh/enlighten.c index 1861a2ba0f2b..c0a502f7e3a7 100644 --- a/arch/x86/platform/pvh/enlighten.c +++ b/arch/x86/platform/pvh/enlighten.c @@ -86,7 +86,7 @@ static void __init init_pvh_bootparams(bool xen_guest) } /* - * See Documentation/x86/boot.txt. + * See Documentation/x86/boot.rst. * * Version 2.12 supports Xen entry point but we will use default x86/PC * environment (i.e. hardware_subarch 0). diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 283ee94224c6..2438f37f2ca1 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -333,7 +333,7 @@ config ACPI_CUSTOM_DSDT_FILE depends on !STANDALONE help This option supports a custom DSDT by linking it into the kernel. - See Documentation/acpi/dsdt-override.txt + See Documentation/admin-guide/acpi/dsdt-override.rst Enter the full path name to the file which includes the AmlCode or dsdt_aml_code declaration. @@ -355,7 +355,7 @@ config ACPI_TABLE_UPGRADE This option provides functionality to upgrade arbitrary ACPI tables via initrd. No functional change if no ACPI tables are passed via initrd, therefore it's safe to say Y. - See Documentation/acpi/initrd_table_override.txt for details + See Documentation/admin-guide/acpi/initrd_table_override.rst for details config ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD bool "Override ACPI tables from built-in initrd" @@ -365,7 +365,7 @@ config ACPI_TABLE_OVERRIDE_VIA_BUILTIN_INITRD This option provides functionality to override arbitrary ACPI tables from built-in uncompressed initrd. - See Documentation/acpi/initrd_table_override.txt for details + See Documentation/admin-guide/acpi/initrd_table_override.rst for details config ACPI_DEBUG bool "Debug Statements" @@ -374,7 +374,7 @@ config ACPI_DEBUG output and increases the kernel size by around 50K. Use the acpi.debug_layer and acpi.debug_level kernel command-line - parameters documented in Documentation/acpi/debug.txt and + parameters documented in Documentation/firmware-guide/acpi/debug.rst and Documentation/admin-guide/kernel-parameters.rst to control the type and amount of debug output. @@ -445,7 +445,7 @@ config ACPI_CUSTOM_METHOD help This debug facility allows ACPI AML methods to be inserted and/or replaced without rebooting the system. For details refer to: - Documentation/acpi/method-customizing.txt. + Documentation/firmware-guide/acpi/method-customizing.rst. NOTE: This option is security sensitive, because it allows arbitrary kernel memory to be written to by root (uid=0) users, allowing them diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c index b17b79e612a3..ac6280ad43a1 100644 --- a/drivers/net/ethernet/faraday/ftgmac100.c +++ b/drivers/net/ethernet/faraday/ftgmac100.c @@ -1075,7 +1075,7 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv, phy_interface_t intf) } /* Indicate that we support PAUSE frames (see comment in - * Documentation/networking/phy.txt) + * Documentation/networking/phy.rst) */ phy_support_asym_pause(phydev); diff --git a/drivers/staging/fieldbus/Documentation/fieldbus_dev.txt b/drivers/staging/fieldbus/Documentation/fieldbus_dev.txt index 56af3f650fa3..89fb8e14676f 100644 --- a/drivers/staging/fieldbus/Documentation/fieldbus_dev.txt +++ b/drivers/staging/fieldbus/Documentation/fieldbus_dev.txt @@ -54,8 +54,8 @@ a limited few common behaviours and properties. This allows us to define a simple interface consisting of a character device and a set of sysfs files: See: -Documentation/ABI/testing/sysfs-class-fieldbus-dev -Documentation/ABI/testing/fieldbus-dev-cdev +drivers/staging/fieldbus/Documentation/ABI/sysfs-class-fieldbus-dev +drivers/staging/fieldbus/Documentation/ABI/fieldbus-dev-cdev Note that this simple interface does not provide a way to modify adapter configuration settings. It is therefore useful only for adapters that get their diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1e3ed41ae1f3..69938dbae2d0 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1694,7 +1694,7 @@ EXPORT_SYMBOL_GPL(vhost_dev_ioctl); /* TODO: This is really inefficient. We need something like get_user() * (instruction directly accesses the data, with an exception table entry - * returning -EFAULT). See Documentation/x86/exception-tables.txt. + * returning -EFAULT). See Documentation/x86/exception-tables.rst. */ static int set_bit_to_user(int nr, void __user *addr) { diff --git a/include/acpi/acpi_drivers.h b/include/acpi/acpi_drivers.h index de1804aeaf69..98e3db7a89cd 100644 --- a/include/acpi/acpi_drivers.h +++ b/include/acpi/acpi_drivers.h @@ -25,7 +25,7 @@ #define ACPI_MAX_STRING 80 /* - * Please update drivers/acpi/debug.c and Documentation/acpi/debug.txt + * Please update drivers/acpi/debug.c and Documentation/firmware-guide/acpi/debug.rst * if you add to this list. */ #define ACPI_BUS_COMPONENT 0x00010000 diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index 1f966670c8dc..623eb58560b9 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -85,7 +85,7 @@ struct fs_parameter { * Superblock creation fills in ->root whereas reconfiguration begins with this * already set. * - * See Documentation/filesystems/mounting.txt + * See Documentation/filesystems/mount_api.txt */ struct fs_context { const struct fs_context_operations *ops; diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 47f58cfb6a19..df1318d85f7d 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -77,7 +77,7 @@ * state. This is called immediately after commit_creds(). * * Security hooks for mount using fs_context. - * [See also Documentation/filesystems/mounting.txt] + * [See also Documentation/filesystems/mount_api.txt] * * @fs_context_dup: * Allocate and attach a security structure to sc->security. This pointer diff --git a/mm/Kconfig b/mm/Kconfig index ee8d1f311858..6e5fb81bde4b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -165,7 +165,7 @@ config MEMORY_HOTPLUG_DEFAULT_ONLINE onlining policy (/sys/devices/system/memory/auto_online_blocks) which determines what happens to newly added memory regions. Policy setting can always be changed at runtime. - See Documentation/memory-hotplug.txt for more information. + See Documentation/admin-guide/mm/memory-hotplug.rst for more information. Say Y here if you want all hot-plugged memory blocks to appear in 'online' state by default. diff --git a/security/Kconfig b/security/Kconfig index aeac3676dd4d..6d75ed71970c 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -62,7 +62,7 @@ config PAGE_TABLE_ISOLATION ensuring that the majority of kernel addresses are not mapped into userspace. - See Documentation/x86/pti.txt for more details. + See Documentation/x86/pti.rst for more details. config SECURITY_INFINIBAND bool "Infiniband Security Hooks" diff --git a/tools/include/linux/err.h b/tools/include/linux/err.h index 2f5a12b88a86..25f2bb3a991d 100644 --- a/tools/include/linux/err.h +++ b/tools/include/linux/err.h @@ -20,7 +20,7 @@ * Userspace note: * The same principle works for userspace, because 'error' pointers * fall down to the unused hole far from user space, as described - * in Documentation/x86/x86_64/mm.txt for x86_64 arch: + * in Documentation/x86/x86_64/mm.rst for x86_64 arch: * * 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm hole caused by [48:63] sign extension * ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole diff --git a/tools/objtool/Documentation/stack-validation.txt b/tools/objtool/Documentation/stack-validation.txt index 4dd11a554b9b..de094670050b 100644 --- a/tools/objtool/Documentation/stack-validation.txt +++ b/tools/objtool/Documentation/stack-validation.txt @@ -21,7 +21,7 @@ instructions). Similarly, it knows how to follow switch statements, for which gcc sometimes uses jump tables. (Objtool also has an 'orc generate' subcommand which generates debuginfo -for the ORC unwinder. See Documentation/x86/orc-unwinder.txt in the +for the ORC unwinder. See Documentation/x86/orc-unwinder.rst in the kernel tree for more details.) @@ -101,7 +101,7 @@ b) ORC (Oops Rewind Capability) unwind table generation band. So it doesn't affect runtime performance and it can be reliable even when interrupts or exceptions are involved. - For more details, see Documentation/x86/orc-unwinder.txt. + For more details, see Documentation/x86/orc-unwinder.rst. c) Higher live patching compatibility rate -- cgit v1.2.3 From b640fbad2d8fe120c761f61eb6c96f05047100cd Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 7 Jun 2019 15:54:36 -0300 Subject: docs: pci: fix broken links due to conversion from pci.txt to pci.rst Some documentation files were still pointing to the old place. Fixes: 229b4e0728e0 ("Documentation: PCI: convert pci.txt to reST") Signed-off-by: Mauro Carvalho Chehab Acked-by: Paul E. McKenney Signed-off-by: Jonathan Corbet --- Documentation/memory-barriers.txt | 2 +- Documentation/translations/ko_KR/memory-barriers.txt | 2 +- drivers/scsi/hpsa.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) (limited to 'drivers') diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index f70ebcdfe592..f4170aae1d75 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -548,7 +548,7 @@ There are certain things that the Linux kernel memory barriers do not guarantee: [*] For information on bus mastering DMA and coherency please read: - Documentation/PCI/pci.txt + Documentation/PCI/pci.rst Documentation/DMA-API-HOWTO.txt Documentation/DMA-API.txt diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index db0b9d8619f1..07725b1df002 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -569,7 +569,7 @@ ACQUIRE 는 해당 오퍼레이션의 로드 부분에만 적용되고 RELEASE [*] 버스 마스터링 DMA 와 일관성에 대해서는 다음을 참고하시기 바랍니다: - Documentation/PCI/pci.txt + Documentation/PCI/pci.rst Documentation/DMA-API-HOWTO.txt Documentation/DMA-API.txt diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index 1bef1da273c2..53df6f7dd3f9 100644 --- a/drivers/scsi/hpsa.c +++ b/drivers/scsi/hpsa.c @@ -7760,7 +7760,7 @@ static void hpsa_free_pci_init(struct ctlr_info *h) hpsa_disable_interrupt_mode(h); /* pci_init 2 */ /* * call pci_disable_device before pci_release_regions per - * Documentation/PCI/pci.txt + * Documentation/PCI/pci.rst */ pci_disable_device(h->pdev); /* pci_init 1 */ pci_release_regions(h->pdev); /* pci_init 2 */ @@ -7843,7 +7843,7 @@ clean2: /* intmode+region, pci */ clean1: /* * call pci_disable_device before pci_release_regions per - * Documentation/PCI/pci.txt + * Documentation/PCI/pci.rst */ pci_disable_device(h->pdev); pci_release_regions(h->pdev); -- cgit v1.2.3 From e327cfcb25422c91f4bb8e8a3488386ac95955f1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 12 Jun 2019 14:52:39 -0300 Subject: docs: cdrom-standard.tex: convert from LaTeX to ReST This is the only LaTeX documentation file inside the documentation. Instead of having a Latex document directly there, convert it to ReST format, as this is the format we're using for docs. For now, let's keep the extension as .txt in order to avoid warnings when building the documentation with Sphinx. The next patch patch will rename it to .rst and add it to the building system. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/cdrom/Makefile | 21 - Documentation/cdrom/cdrom-standard.tex | 1026 ------------------------------ Documentation/cdrom/cdrom-standard.txt | 1063 ++++++++++++++++++++++++++++++++ drivers/cdrom/cdrom.c | 2 +- 4 files changed, 1064 insertions(+), 1048 deletions(-) delete mode 100644 Documentation/cdrom/Makefile delete mode 100644 Documentation/cdrom/cdrom-standard.tex create mode 100644 Documentation/cdrom/cdrom-standard.txt (limited to 'drivers') diff --git a/Documentation/cdrom/Makefile b/Documentation/cdrom/Makefile deleted file mode 100644 index a19e321928e1..000000000000 --- a/Documentation/cdrom/Makefile +++ /dev/null @@ -1,21 +0,0 @@ -LATEXFILE = cdrom-standard - -all: - make clean - latex $(LATEXFILE) - latex $(LATEXFILE) - @if [ -x `which gv` ]; then \ - `dvips -q -t letter -o $(LATEXFILE).ps $(LATEXFILE).dvi` ;\ - `gv -antialias -media letter -nocenter $(LATEXFILE).ps` ;\ - else \ - `xdvi $(LATEXFILE).dvi &` ;\ - fi - make sortofclean - -clean: - rm -f $(LATEXFILE).ps $(LATEXFILE).dvi $(LATEXFILE).aux $(LATEXFILE).log - -sortofclean: - rm -f $(LATEXFILE).aux $(LATEXFILE).log - - diff --git a/Documentation/cdrom/cdrom-standard.tex b/Documentation/cdrom/cdrom-standard.tex deleted file mode 100644 index f7cd455973f7..000000000000 --- a/Documentation/cdrom/cdrom-standard.tex +++ /dev/null @@ -1,1026 +0,0 @@ -\documentclass{article} -\def\version{$Id: cdrom-standard.tex,v 1.9 1997/12/28 15:42:49 david Exp $} -\newcommand{\newsection}[1]{\newpage\section{#1}} - -\evensidemargin=0pt -\oddsidemargin=0pt -\topmargin=-\headheight \advance\topmargin by -\headsep -\textwidth=15.99cm \textheight=24.62cm % normal A4, 1'' margin - -\def\linux{{\sc Linux}} -\def\cdrom{{\sc cd-rom}} -\def\UCD{{\sc Uniform cd-rom Driver}} -\def\cdromc{{\tt {cdrom.c}}} -\def\cdromh{{\tt {cdrom.h}}} -\def\fo{\sl} % foreign words -\def\ie{{\fo i.e.}} -\def\eg{{\fo e.g.}} - -\everymath{\it} \everydisplay{\it} -\catcode `\_=\active \def_{\_\penalty100 } -\catcode`\<=\active \def<#1>{{\langle\hbox{\rm#1}\rangle}} - -\begin{document} -\title{A \linux\ \cdrom\ standard} -\author{David van Leeuwen\\{\normalsize\tt david@ElseWare.cistron.nl} -\\{\footnotesize updated by Erik Andersen {\tt(andersee@debian.org)}} -\\{\footnotesize updated by Jens Axboe {\tt(axboe@image.dk)}}} -\date{12 March 1999} - -\maketitle - -\newsection{Introduction} - -\linux\ is probably the Unix-like operating system that supports -the widest variety of hardware devices. The reasons for this are -presumably -\begin{itemize} -\item - The large list of hardware devices available for the many platforms - that \linux\ now supports (\ie, i386-PCs, Sparc Suns, etc.) -\item - The open design of the operating system, such that anybody can write a - driver for \linux. -\item - There is plenty of source code around as examples of how to write a driver. -\end{itemize} -The openness of \linux, and the many different types of available -hardware has allowed \linux\ to support many different hardware devices. -Unfortunately, the very openness that has allowed \linux\ to support -all these different devices has also allowed the behavior of each -device driver to differ significantly from one device to another. -This divergence of behavior has been very significant for \cdrom\ -devices; the way a particular drive reacts to a `standard' $ioctl()$ -call varies greatly from one device driver to another. To avoid making -their drivers totally inconsistent, the writers of \linux\ \cdrom\ -drivers generally created new device drivers by understanding, copying, -and then changing an existing one. Unfortunately, this practice did not -maintain uniform behavior across all the \linux\ \cdrom\ drivers. - -This document describes an effort to establish Uniform behavior across -all the different \cdrom\ device drivers for \linux. This document also -defines the various $ioctl$s, and how the low-level \cdrom\ device -drivers should implement them. Currently (as of the \linux\ 2.1.$x$ -development kernels) several low-level \cdrom\ device drivers, including -both IDE/ATAPI and SCSI, now use this Uniform interface. - -When the \cdrom\ was developed, the interface between the \cdrom\ drive -and the computer was not specified in the standards. As a result, many -different \cdrom\ interfaces were developed. Some of them had their -own proprietary design (Sony, Mitsumi, Panasonic, Philips), other -manufacturers adopted an existing electrical interface and changed -the functionality (CreativeLabs/SoundBlaster, Teac, Funai) or simply -adapted their drives to one or more of the already existing electrical -interfaces (Aztech, Sanyo, Funai, Vertos, Longshine, Optics Storage and -most of the `NoName' manufacturers). In cases where a new drive really -brought its own interface or used its own command set and flow control -scheme, either a separate driver had to be written, or an existing -driver had to be enhanced. History has delivered us \cdrom\ support for -many of these different interfaces. Nowadays, almost all new \cdrom\ -drives are either IDE/ATAPI or SCSI, and it is very unlikely that any -manufacturer will create a new interface. Even finding drives for the -old proprietary interfaces is getting difficult. - -When (in the 1.3.70's) I looked at the existing software interface, -which was expressed through \cdromh, it appeared to be a rather wild -set of commands and data formats.\footnote{I cannot recollect what -kernel version I looked at, then, presumably 1.2.13 and 1.3.34---the -latest kernel that I was indirectly involved in.} It seemed that many -features of the software interface had been added to accommodate the -capabilities of a particular drive, in an {\fo ad hoc\/} manner. More -importantly, it appeared that the behavior of the `standard' commands -was different for most of the different drivers: \eg, some drivers -close the tray if an $open()$ call occurs when the tray is open, while -others do not. Some drivers lock the door upon opening the device, to -prevent an incoherent file system, but others don't, to allow software -ejection. Undoubtedly, the capabilities of the different drives vary, -but even when two drives have the same capability their drivers' -behavior was usually different. - -I decided to start a discussion on how to make all the \linux\ \cdrom\ -drivers behave more uniformly. I began by contacting the developers of -the many \cdrom\ drivers found in the \linux\ kernel. Their reactions -encouraged me to write the \UCD\ which this document is intended to -describe. The implementation of the \UCD\ is in the file \cdromc. This -driver is intended to be an additional software layer that sits on top -of the low-level device drivers for each \cdrom\ drive. By adding this -additional layer, it is possible to have all the different \cdrom\ -devices behave {\em exactly\/} the same (insofar as the underlying -hardware will allow). - -The goal of the \UCD\ is {\em not\/} to alienate driver developers who -have not yet taken steps to support this effort. The goal of \UCD\ is -simply to give people writing application programs for \cdrom\ drives -{\em one\/} \linux\ \cdrom\ interface with consistent behavior for all -\cdrom\ devices. In addition, this also provides a consistent interface -between the low-level device driver code and the \linux\ kernel. Care -is taken that 100\,\% compatibility exists with the data structures and -programmer's interface defined in \cdromh. This guide was written to -help \cdrom\ driver developers adapt their code to use the \UCD\ code -defined in \cdromc. - -Personally, I think that the most important hardware interfaces are -the IDE/ATAPI drives and, of course, the SCSI drives, but as prices -of hardware drop continuously, it is also likely that people may have -more than one \cdrom\ drive, possibly of mixed types. It is important -that these drives behave in the same way. In December 1994, one of the -cheapest \cdrom\ drives was a Philips cm206, a double-speed proprietary -drive. In the months that I was busy writing a \linux\ driver for it, -proprietary drives became obsolete and IDE/ATAPI drives became the -standard. At the time of the last update to this document (November -1997) it is becoming difficult to even {\em find} anything less than a -16 speed \cdrom\ drive, and 24 speed drives are common. - -\newsection{Standardizing through another software level} -\label{cdrom.c} - -At the time this document was conceived, all drivers directly -implemented the \cdrom\ $ioctl()$ calls through their own routines. This -led to the danger of different drivers forgetting to do important things -like checking that the user was giving the driver valid data. More -importantly, this led to the divergence of behavior, which has already -been discussed. - -For this reason, the \UCD\ was created to enforce consistent \cdrom\ -drive behavior, and to provide a common set of services to the various -low-level \cdrom\ device drivers. The \UCD\ now provides another -software-level, that separates the $ioctl()$ and $open()$ implementation -from the actual hardware implementation. Note that this effort has -made few changes which will affect a user's application programs. The -greatest change involved moving the contents of the various low-level -\cdrom\ drivers' header files to the kernel's cdrom directory. This was -done to help ensure that the user is only presented with only one cdrom -interface, the interface defined in \cdromh. - -\cdrom\ drives are specific enough (\ie, different from other -block-devices such as floppy or hard disc drives), to define a set -of common {\em \cdrom\ device operations}, $_dops$. -These operations are different from the classical block-device file -operations, $_fops$. - -The routines for the \UCD\ interface level are implemented in the file -\cdromc. In this file, the \UCD\ interfaces with the kernel as a block -device by registering the following general $struct\ file_operations$: -$$ -\halign{$#$\ \hfil&$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr -struct& file_operations\ cdrom_fops = \{\hidewidth\cr - &NULL, & lseek \cr - &block_read, & read---general block-dev read \cr - &block_write, & write---general block-dev write \cr - &NULL, & readdir \cr - &NULL, & select \cr - &cdrom_ioctl, & ioctl \cr - &NULL, & mmap \cr - &cdrom_open, & open \cr - &cdrom_release, & release \cr - &NULL, & fsync \cr - &NULL, & fasync \cr - &cdrom_media_changed, & media change \cr - &NULL & revalidate \cr -\};\cr -} -$$ - -Every active \cdrom\ device shares this $struct$. The routines -declared above are all implemented in \cdromc, since this file is the -place where the behavior of all \cdrom-devices is defined and -standardized. The actual interface to the various types of \cdrom\ -hardware is still performed by various low-level \cdrom-device -drivers. These routines simply implement certain {\em capabilities\/} -that are common to all \cdrom\ (and really, all removable-media -devices). - -Registration of a low-level \cdrom\ device driver is now done through -the general routines in \cdromc, not through the Virtual File System -(VFS) any more. The interface implemented in \cdromc\ is carried out -through two general structures that contain information about the -capabilities of the driver, and the specific drives on which the -driver operates. The structures are: -\begin{description} -\item[$cdrom_device_ops$] - This structure contains information about the low-level driver for a - \cdrom\ device. This structure is conceptually connected to the major - number of the device (although some drivers may have different - major numbers, as is the case for the IDE driver). -\item[$cdrom_device_info$] - This structure contains information about a particular \cdrom\ drive, - such as its device name, speed, etc. This structure is conceptually - connected to the minor number of the device. -\end{description} - -Registering a particular \cdrom\ drive with the \UCD\ is done by the -low-level device driver though a call to: -$$register_cdrom(struct\ cdrom_device_info * _info) -$$ -The device information structure, $_info$, contains all the -information needed for the kernel to interface with the low-level -\cdrom\ device driver. One of the most important entries in this -structure is a pointer to the $cdrom_device_ops$ structure of the -low-level driver. - -The device operations structure, $cdrom_device_ops$, contains a list -of pointers to the functions which are implemented in the low-level -device driver. When \cdromc\ accesses a \cdrom\ device, it does it -through the functions in this structure. It is impossible to know all -the capabilities of future \cdrom\ drives, so it is expected that this -list may need to be expanded from time to time as new technologies are -developed. For example, CD-R and CD-R/W drives are beginning to become -popular, and support will soon need to be added for them. For now, the -current $struct$ is: -$$ -\halign{$#$\ \hfil&$#$\ \hfil&\hbox to 10em{$#$\hss}& - $/*$ \rm# $*/$\hfil\cr -struct& cdrom_device_ops\ \{ \hidewidth\cr - &int& (* open)(struct\ cdrom_device_info *, int)\cr - &void& (* release)(struct\ cdrom_device_info *);\cr - &int& (* drive_status)(struct\ cdrom_device_info *, int);\cr - &unsigned\ int& (* check_events)(struct\ cdrom_device_info *, unsigned\ int, int);\cr - &int& (* media_changed)(struct\ cdrom_device_info *, int);\cr - &int& (* tray_move)(struct\ cdrom_device_info *, int);\cr - &int& (* lock_door)(struct\ cdrom_device_info *, int);\cr - &int& (* select_speed)(struct\ cdrom_device_info *, int);\cr - &int& (* select_disc)(struct\ cdrom_device_info *, int);\cr - &int& (* get_last_session) (struct\ cdrom_device_info *, - struct\ cdrom_multisession *{});\cr - &int& (* get_mcn)(struct\ cdrom_device_info *, struct\ cdrom_mcn *{});\cr - &int& (* reset)(struct\ cdrom_device_info *);\cr - &int& (* audio_ioctl)(struct\ cdrom_device_info *, unsigned\ int, - void *{});\cr -\noalign{\medskip} - &const\ int& capability;& capability flags \cr - &int& (* generic_packet)(struct\ cdrom_device_info *, struct\ packet_command *{});\cr -\};\cr -} -$$ -When a low-level device driver implements one of these capabilities, -it should add a function pointer to this $struct$. When a particular -function is not implemented, however, this $struct$ should contain a -NULL instead. The $capability$ flags specify the capabilities of the -\cdrom\ hardware and/or low-level \cdrom\ driver when a \cdrom\ drive -is registered with the \UCD. - -Note that most functions have fewer parameters than their -$blkdev_fops$ counterparts. This is because very little of the -information in the structures $inode$ and $file$ is used. For most -drivers, the main parameter is the $struct$ $cdrom_device_info$, from -which the major and minor number can be extracted. (Most low-level -\cdrom\ drivers don't even look at the major and minor number though, -since many of them only support one device.) This will be available -through $dev$ in $cdrom_device_info$ described below. - -The drive-specific, minor-like information that is registered with -\cdromc, currently contains the following fields: -$$ -\halign{$#$\ \hfil&$#$\ \hfil&\hbox to 10em{$#$\hss}& - $/*$ \rm# $*/$\hfil\cr -struct& cdrom_device_info\ \{ \hidewidth\cr - & const\ struct\ cdrom_device_ops *& ops;& device operations for this major\cr - & struct\ list_head& list;& linked list of all device_info\cr - & struct\ gendisk *& disk;& matching block layer disk\cr - & void *& handle;& driver-dependent data\cr -\noalign{\medskip} - & int& mask;& mask of capability: disables them \cr - & int& speed;& maximum speed for reading data \cr - & int& capacity;& number of discs in a jukebox \cr -\noalign{\medskip} - &unsigned\ int& options : 30;& options flags \cr - &unsigned& mc_flags : 2;& media-change buffer flags \cr - &unsigned\ int& vfs_events;& cached events for vfs path\cr - &unsigned\ int& ioctl_events;& cached events for ioctl path\cr - & int& use_count;& number of times device is opened\cr - & char& name[20];& name of the device type\cr -\noalign{\medskip} - &__u8& sanyo_slot : 2;& Sanyo 3-CD changer support\cr - &__u8& keeplocked : 1;& CDROM_LOCKDOOR status\cr - &__u8& reserved : 5;& not used yet\cr - & int& cdda_method;& see CDDA_* flags\cr - &__u8& last_sense;& saves last sense key\cr - &__u8& media_written;& dirty flag, DVD+RW bookkeeping\cr - &unsigned\ short& mmc3_profile;& current MMC3 profile\cr - & int& for_data;& unknown:TBD\cr - & int\ (* exit)\ (struct\ cdrom_device_info *);&& unknown:TBD\cr - & int& mrw_mode_page;& which MRW mode page is in use\cr -\}\cr -}$$ -Using this $struct$, a linked list of the registered minor devices is -built, using the $next$ field. The device number, the device operations -struct and specifications of properties of the drive are stored in this -structure. - -The $mask$ flags can be used to mask out some of the capabilities listed -in $ops\to capability$, if a specific drive doesn't support a feature -of the driver. The value $speed$ specifies the maximum head-rate of the -drive, measured in units of normal audio speed (176\,kB/sec raw data or -150\,kB/sec file system data). The parameters are declared $const$ -because they describe properties of the drive, which don't change after -registration. - -A few registers contain variables local to the \cdrom\ drive. The -flags $options$ are used to specify how the general \cdrom\ routines -should behave. These various flags registers should provide enough -flexibility to adapt to the different users' wishes (and {\em not\/} the -`arbitrary' wishes of the author of the low-level device driver, as is -the case in the old scheme). The register $mc_flags$ is used to buffer -the information from $media_changed()$ to two separate queues. Other -data that is specific to a minor drive, can be accessed through $handle$, -which can point to a data structure specific to the low-level driver. -The fields $use_count$, $next$, $options$ and $mc_flags$ need not be -initialized. - -The intermediate software layer that \cdromc\ forms will perform some -additional bookkeeping. The use count of the device (the number of -processes that have the device opened) is registered in $use_count$. The -function $cdrom_ioctl()$ will verify the appropriate user-memory regions -for read and write, and in case a location on the CD is transferred, -it will `sanitize' the format by making requests to the low-level -drivers in a standard format, and translating all formats between the -user-software and low level drivers. This relieves much of the drivers' -memory checking and format checking and translation. Also, the necessary -structures will be declared on the program stack. - -The implementation of the functions should be as defined in the -following sections. Two functions {\em must\/} be implemented, namely -$open()$ and $release()$. Other functions may be omitted, their -corresponding capability flags will be cleared upon registration. -Generally, a function returns zero on success and negative on error. A -function call should return only after the command has completed, but of -course waiting for the device should not use processor time. - -\subsection{$Int\ open(struct\ cdrom_device_info * cdi, int\ purpose)$} - -$Open()$ should try to open the device for a specific $purpose$, which -can be either: -\begin{itemize} -\item[0] Open for reading data, as done by {\tt {mount()}} (2), or the -user commands {\tt {dd}} or {\tt {cat}}. -\item[1] Open for $ioctl$ commands, as done by audio-CD playing -programs. -\end{itemize} -Notice that any strategic code (closing tray upon $open()$, etc.)\ is -done by the calling routine in \cdromc, so the low-level routine -should only be concerned with proper initialization, such as spinning -up the disc, etc. % and device-use count - - -\subsection{$Void\ release(struct\ cdrom_device_info * cdi)$} - - -Device-specific actions should be taken such as spinning down the device. -However, strategic actions such as ejection of the tray, or unlocking -the door, should be left over to the general routine $cdrom_release()$. -This is the only function returning type $void$. - -\subsection{$Int\ drive_status(struct\ cdrom_device_info * cdi, int\ slot_nr)$} -\label{drive status} - -The function $drive_status$, if implemented, should provide -information on the status of the drive (not the status of the disc, -which may or may not be in the drive). If the drive is not a changer, -$slot_nr$ should be ignored. In \cdromh\ the possibilities are listed: -$$ -\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr -CDS_NO_INFO& no information available\cr -CDS_NO_DISC& no disc is inserted, tray is closed\cr -CDS_TRAY_OPEN& tray is opened\cr -CDS_DRIVE_NOT_READY& something is wrong, tray is moving?\cr -CDS_DISC_OK& a disc is loaded and everything is fine\cr -} -$$ - -\subsection{$Int\ media_changed(struct\ cdrom_device_info * cdi, int\ disc_nr)$} - -This function is very similar to the original function in $struct\ -file_operations$. It returns 1 if the medium of the device $cdi\to -dev$ has changed since the last call, and 0 otherwise. The parameter -$disc_nr$ identifies a specific slot in a juke-box, it should be -ignored for single-disc drives. Note that by `re-routing' this -function through $cdrom_media_changed()$, we can implement separate -queues for the VFS and a new $ioctl()$ function that can report device -changes to software (\eg, an auto-mounting daemon). - -\subsection{$Int\ tray_move(struct\ cdrom_device_info * cdi, int\ position)$} - -This function, if implemented, should control the tray movement. (No -other function should control this.) The parameter $position$ controls -the desired direction of movement: -\begin{itemize} -\item[0] Close tray -\item[1] Open tray -\end{itemize} -This function returns 0 upon success, and a non-zero value upon -error. Note that if the tray is already in the desired position, no -action need be taken, and the return value should be 0. - -\subsection{$Int\ lock_door(struct\ cdrom_device_info * cdi, int\ lock)$} - -This function (and no other code) controls locking of the door, if the -drive allows this. The value of $lock$ controls the desired locking -state: -\begin{itemize} -\item[0] Unlock door, manual opening is allowed -\item[1] Lock door, tray cannot be ejected manually -\end{itemize} -This function returns 0 upon success, and a non-zero value upon -error. Note that if the door is already in the requested state, no -action need be taken, and the return value should be 0. - -\subsection{$Int\ select_speed(struct\ cdrom_device_info * cdi, int\ speed)$} - -Some \cdrom\ drives are capable of changing their head-speed. There -are several reasons for changing the speed of a \cdrom\ drive. Badly -pressed \cdrom s may benefit from less-than-maximum head rate. Modern -\cdrom\ drives can obtain very high head rates (up to $24\times$ is -common). It has been reported that these drives can make reading -errors at these high speeds, reducing the speed can prevent data loss -in these circumstances. Finally, some of these drives can -make an annoyingly loud noise, which a lower speed may reduce. %Finally, -%although the audio-low-pass filters probably aren't designed for it, -%more than real-time playback of audio might be used for high-speed -%copying of audio tracks. - -This function specifies the speed at which data is read or audio is -played back. The value of $speed$ specifies the head-speed of the -drive, measured in units of standard cdrom speed (176\,kB/sec raw data -or 150\,kB/sec file system data). So to request that a \cdrom\ drive -operate at 300\,kB/sec you would call the CDROM_SELECT_SPEED $ioctl$ -with $speed=2$. The special value `0' means `auto-selection', \ie, -maximum data-rate or real-time audio rate. If the drive doesn't have -this `auto-selection' capability, the decision should be made on the -current disc loaded and the return value should be positive. A negative -return value indicates an error. - -\subsection{$Int\ select_disc(struct\ cdrom_device_info * cdi, int\ number)$} - -If the drive can store multiple discs (a juke-box) this function -will perform disc selection. It should return the number of the -selected disc on success, a negative value on error. Currently, only -the ide-cd driver supports this functionality. - -\subsection{$Int\ get_last_session(struct\ cdrom_device_info * cdi, struct\ - cdrom_multisession * ms_info)$} - -This function should implement the old corresponding $ioctl()$. For -device $cdi\to dev$, the start of the last session of the current disc -should be returned in the pointer argument $ms_info$. Note that -routines in \cdromc\ have sanitized this argument: its requested -format will {\em always\/} be of the type $CDROM_LBA$ (linear block -addressing mode), whatever the calling software requested. But -sanitization goes even further: the low-level implementation may -return the requested information in $CDROM_MSF$ format if it wishes so -(setting the $ms_info\rightarrow addr_format$ field appropriately, of -course) and the routines in \cdromc\ will make the transformation if -necessary. The return value is 0 upon success. - -\subsection{$Int\ get_mcn(struct\ cdrom_device_info * cdi, struct\ - cdrom_mcn * mcn)$} - -Some discs carry a `Media Catalog Number' (MCN), also called -`Universal Product Code' (UPC). This number should reflect the number -that is generally found in the bar-code on the product. Unfortunately, -the few discs that carry such a number on the disc don't even use the -same format. The return argument to this function is a pointer to a -pre-declared memory region of type $struct\ cdrom_mcn$. The MCN is -expected as a 13-character string, terminated by a null-character. - -\subsection{$Int\ reset(struct\ cdrom_device_info * cdi)$} - -This call should perform a hard-reset on the drive (although in -circumstances that a hard-reset is necessary, a drive may very well not -listen to commands anymore). Preferably, control is returned to the -caller only after the drive has finished resetting. If the drive is no -longer listening, it may be wise for the underlying low-level cdrom -driver to time out. - -\subsection{$Int\ audio_ioctl(struct\ cdrom_device_info * cdi, unsigned\ - int\ cmd, void * arg)$} - -Some of the \cdrom-$ioctl$s defined in \cdromh\ can be -implemented by the routines described above, and hence the function -$cdrom_ioctl$ will use those. However, most $ioctl$s deal with -audio-control. We have decided to leave these to be accessed through a -single function, repeating the arguments $cmd$ and $arg$. Note that -the latter is of type $void*{}$, rather than $unsigned\ long\ -int$. The routine $cdrom_ioctl()$ does do some useful things, -though. It sanitizes the address format type to $CDROM_MSF$ (Minutes, -Seconds, Frames) for all audio calls. It also verifies the memory -location of $arg$, and reserves stack-memory for the argument. This -makes implementation of the $audio_ioctl()$ much simpler than in the -old driver scheme. For example, you may look up the function -$cm206_audio_ioctl()$ in {\tt {cm206.c}} that should be updated with -this documentation. - -An unimplemented ioctl should return $-ENOSYS$, but a harmless request -(\eg, $CDROMSTART$) may be ignored by returning 0 (success). Other -errors should be according to the standards, whatever they are. When -an error is returned by the low-level driver, the \UCD\ tries whenever -possible to return the error code to the calling program. (We may decide -to sanitize the return value in $cdrom_ioctl()$ though, in order to -guarantee a uniform interface to the audio-player software.) - -\subsection{$Int\ dev_ioctl(struct\ cdrom_device_info * cdi, unsigned\ int\ - cmd, unsigned\ long\ arg)$} - -Some $ioctl$s seem to be specific to certain \cdrom\ drives. That is, -they are introduced to service some capabilities of certain drives. In -fact, there are 6 different $ioctl$s for reading data, either in some -particular kind of format, or audio data. Not many drives support -reading audio tracks as data, I believe this is because of protection -of copyrights of artists. Moreover, I think that if audio-tracks are -supported, it should be done through the VFS and not via $ioctl$s. A -problem here could be the fact that audio-frames are 2352 bytes long, -so either the audio-file-system should ask for 75264 bytes at once -(the least common multiple of 512 and 2352), or the drivers should -bend their backs to cope with this incoherence (to which I would be -opposed). Furthermore, it is very difficult for the hardware to find -the exact frame boundaries, since there are no synchronization headers -in audio frames. Once these issues are resolved, this code should be -standardized in \cdromc. - -Because there are so many $ioctl$s that seem to be introduced to -satisfy certain drivers,\footnote{Is there software around that - actually uses these? I'd be interested!} any `non-standard' $ioctl$s -are routed through the call $dev_ioctl()$. In principle, `private' -$ioctl$s should be numbered after the device's major number, and not -the general \cdrom\ $ioctl$ number, {\tt {0x53}}. Currently the -non-supported $ioctl$s are: {\it CDROMREADMODE1, CDROMREADMODE2, - CDROMREADAUDIO, CDROMREADRAW, CDROMREADCOOKED, CDROMSEEK, - CDROMPLAY\-BLK and CDROM\-READALL}. - - -\subsection{\cdrom\ capabilities} -\label{capability} - -Instead of just implementing some $ioctl$ calls, the interface in -\cdromc\ supplies the possibility to indicate the {\em capabilities\/} -of a \cdrom\ drive. This can be done by ORing any number of -capability-constants that are defined in \cdromh\ at the registration -phase. Currently, the capabilities are any of: -$$ -\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr -CDC_CLOSE_TRAY& can close tray by software control\cr -CDC_OPEN_TRAY& can open tray\cr -CDC_LOCK& can lock and unlock the door\cr -CDC_SELECT_SPEED& can select speed, in units of $\sim$150\,kB/s\cr -CDC_SELECT_DISC& drive is juke-box\cr -CDC_MULTI_SESSION& can read sessions $>\rm1$\cr -CDC_MCN& can read Media Catalog Number\cr -CDC_MEDIA_CHANGED& can report if disc has changed\cr -CDC_PLAY_AUDIO& can perform audio-functions (play, pause, etc)\cr -CDC_RESET& hard reset device\cr -CDC_IOCTLS& driver has non-standard ioctls\cr -CDC_DRIVE_STATUS& driver implements drive status\cr -} -$$ -The capability flag is declared $const$, to prevent drivers from -accidentally tampering with the contents. The capability fags actually -inform \cdromc\ of what the driver can do. If the drive found -by the driver does not have the capability, is can be masked out by -the $cdrom_device_info$ variable $mask$. For instance, the SCSI \cdrom\ -driver has implemented the code for loading and ejecting \cdrom's, and -hence its corresponding flags in $capability$ will be set. But a SCSI -\cdrom\ drive might be a caddy system, which can't load the tray, and -hence for this drive the $cdrom_device_info$ struct will have set -the $CDC_CLOSE_TRAY$ bit in $mask$. - -In the file \cdromc\ you will encounter many constructions of the type -$$\it -if\ (cdo\rightarrow capability \mathrel\& \mathord{\sim} cdi\rightarrow mask - \mathrel{\&} CDC_) \ldots -$$ -There is no $ioctl$ to set the mask\dots The reason is that -I think it is better to control the {\em behavior\/} rather than the -{\em capabilities}. - -\subsection{Options} - -A final flag register controls the {\em behavior\/} of the \cdrom\ -drives, in order to satisfy different users' wishes, hopefully -independently of the ideas of the respective author who happened to -have made the drive's support available to the \linux\ community. The -current behavior options are: -$$ -\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr -CDO_AUTO_CLOSE& try to close tray upon device $open()$\cr -CDO_AUTO_EJECT& try to open tray on last device $close()$\cr -CDO_USE_FFLAGS& use $file_pointer\rightarrow f_flags$ to indicate - purpose for $open()$\cr -CDO_LOCK& try to lock door if device is opened\cr -CDO_CHECK_TYPE& ensure disc type is data if opened for data\cr -} -$$ - -The initial value of this register is $CDO_AUTO_CLOSE \mathrel| -CDO_USE_FFLAGS \mathrel| CDO_LOCK$, reflecting my own view on user -interface and software standards. Before you protest, there are two -new $ioctl$s implemented in \cdromc, that allow you to control the -behavior by software. These are: -$$ -\halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr -CDROM_SET_OPTIONS& set options specified in $(int)\ arg$\cr -CDROM_CLEAR_OPTIONS& clear options specified in $(int)\ arg$\cr -} -$$ -One option needs some more explanation: $CDO_USE_FFLAGS$. In the next -newsection we explain what the need for this option is. - -A software package {\tt setcd}, available from the Debian distribution -and {\tt sunsite.unc.edu}, allows user level control of these flags. - -\newsection{The need to know the purpose of opening the \cdrom\ device} - -Traditionally, Unix devices can be used in two different `modes', -either by reading/writing to the device file, or by issuing -controlling commands to the device, by the device's $ioctl()$ -call. The problem with \cdrom\ drives, is that they can be used for -two entirely different purposes. One is to mount removable -file systems, \cdrom s, the other is to play audio CD's. Audio commands -are implemented entirely through $ioctl$s, presumably because the -first implementation (SUN?) has been such. In principle there is -nothing wrong with this, but a good control of the `CD player' demands -that the device can {\em always\/} be opened in order to give the -$ioctl$ commands, regardless of the state the drive is in. - -On the other hand, when used as a removable-media disc drive (what the -original purpose of \cdrom s is) we would like to make sure that the -disc drive is ready for operation upon opening the device. In the old -scheme, some \cdrom\ drivers don't do any integrity checking, resulting -in a number of i/o errors reported by the VFS to the kernel when an -attempt for mounting a \cdrom\ on an empty drive occurs. This is not a -particularly elegant way to find out that there is no \cdrom\ inserted; -it more-or-less looks like the old IBM-PC trying to read an empty floppy -drive for a couple of seconds, after which the system complains it -can't read from it. Nowadays we can {\em sense\/} the existence of a -removable medium in a drive, and we believe we should exploit that -fact. An integrity check on opening of the device, that verifies the -availability of a \cdrom\ and its correct type (data), would be -desirable. - -These two ways of using a \cdrom\ drive, principally for data and -secondarily for playing audio discs, have different demands for the -behavior of the $open()$ call. Audio use simply wants to open the -device in order to get a file handle which is needed for issuing -$ioctl$ commands, while data use wants to open for correct and -reliable data transfer. The only way user programs can indicate what -their {\em purpose\/} of opening the device is, is through the $flags$ -parameter (see {\tt {open(2)}}). For \cdrom\ devices, these flags aren't -implemented (some drivers implement checking for write-related flags, -but this is not strictly necessary if the device file has correct -permission flags). Most option flags simply don't make sense to -\cdrom\ devices: $O_CREAT$, $O_NOCTTY$, $O_TRUNC$, $O_APPEND$, and -$O_SYNC$ have no meaning to a \cdrom. - -We therefore propose to use the flag $O_NONBLOCK$ to indicate -that the device is opened just for issuing $ioctl$ -commands. Strictly, the meaning of $O_NONBLOCK$ is that opening and -subsequent calls to the device don't cause the calling process to -wait. We could interpret this as ``don't wait until someone has -inserted some valid data-\cdrom.'' Thus, our proposal of the -implementation for the $open()$ call for \cdrom s is: -\begin{itemize} -\item If no other flags are set than $O_RDONLY$, the device is opened -for data transfer, and the return value will be 0 only upon successful -initialization of the transfer. The call may even induce some actions -on the \cdrom, such as closing the tray. -\item If the option flag $O_NONBLOCK$ is set, opening will always be -successful, unless the whole device doesn't exist. The drive will take -no actions whatsoever. -\end{itemize} - -\subsection{And what about standards?} - -You might hesitate to accept this proposal as it comes from the -\linux\ community, and not from some standardizing institute. What -about SUN, SGI, HP and all those other Unix and hardware vendors? -Well, these companies are in the lucky position that they generally -control both the hardware and software of their supported products, -and are large enough to set their own standard. They do not have to -deal with a dozen or more different, competing hardware -configurations.\footnote{Incidentally, I think that SUN's approach to -mounting \cdrom s is very good in origin: under Solaris a -volume-daemon automatically mounts a newly inserted \cdrom\ under {\tt -{/cdrom/$$/}}. In my opinion they should have pushed this -further and have {\em every\/} \cdrom\ on the local area network be -mounted at the similar location, \ie, no matter in which particular -machine you insert a \cdrom, it will always appear at the same -position in the directory tree, on every system. When I wanted to -implement such a user-program for \linux, I came across the -differences in behavior of the various drivers, and the need for an -$ioctl$ informing about media changes.} - -We believe that using $O_NONBLOCK$ to indicate that a device is being opened -for $ioctl$ commands only can be easily introduced in the \linux\ -community. All the CD-player authors will have to be informed, we can -even send in our own patches to the programs. The use of $O_NONBLOCK$ -has most likely no influence on the behavior of the CD-players on -other operating systems than \linux. Finally, a user can always revert -to old behavior by a call to $ioctl(file_descriptor, CDROM_CLEAR_OPTIONS, -CDO_USE_FFLAGS)$. - -\subsection{The preferred strategy of $open()$} - -The routines in \cdromc\ are designed in such a way that run-time -configuration of the behavior of \cdrom\ devices (of {\em any\/} type) -can be carried out, by the $CDROM_SET/CLEAR_OPTIONS$ $ioctls$. Thus, various -modes of operation can be set: -\begin{description} -\item[$CDO_AUTO_CLOSE \mathrel| CDO_USE_FFLAGS \mathrel| CDO_LOCK$] This -is the default setting. (With $CDO_CHECK_TYPE$ it will be better, in the -future.) If the device is not yet opened by any other process, and if -the device is being opened for data ($O_NONBLOCK$ is not set) and the -tray is found to be open, an attempt to close the tray is made. Then, -it is verified that a disc is in the drive and, if $CDO_CHECK_TYPE$ is -set, that it contains tracks of type `data mode 1.' Only if all tests -are passed is the return value zero. The door is locked to prevent file -system corruption. If the drive is opened for audio ($O_NONBLOCK$ is -set), no actions are taken and a value of 0 will be returned. -\item[$CDO_AUTO_CLOSE \mathrel| CDO_AUTO_EJECT \mathrel| CDO_LOCK$] This -mimics the behavior of the current sbpcd-driver. The option flags are -ignored, the tray is closed on the first open, if necessary. Similarly, -the tray is opened on the last release, \ie, if a \cdrom\ is unmounted, -it is automatically ejected, such that the user can replace it. -\end{description} -We hope that these option can convince everybody (both driver -maintainers and user program developers) to adopt the new \cdrom\ -driver scheme and option flag interpretation. - -\newsection{Description of routines in \cdromc} - -Only a few routines in \cdromc\ are exported to the drivers. In this -new section we will discuss these, as well as the functions that `take -over' the \cdrom\ interface to the kernel. The header file belonging -to \cdromc\ is called \cdromh. Formerly, some of the contents of this -file were placed in the file {\tt {ucdrom.h}}, but this file has now been -merged back into \cdromh. - -\subsection{$Struct\ file_operations\ cdrom_fops$} - -The contents of this structure were described in section~\ref{cdrom.c}. -A pointer to this structure is assigned to the $fops$ field -of the $struct gendisk$. - -\subsection{$Int\ register_cdrom( struct\ cdrom_device_info\ * cdi)$} - -This function is used in about the same way one registers $cdrom_fops$ -with the kernel, the device operations and information structures, -as described in section~\ref{cdrom.c}, should be registered with the -\UCD: -$$ -register_cdrom(\&_info)); -$$ -This function returns zero upon success, and non-zero upon -failure. The structure $_info$ should have a pointer to the -driver's $_dops$, as in -$$ -\vbox{\halign{&$#$\hfil\cr -struct\ &cdrom_device_info\ _info = \{\cr -& _dops;\cr -&\ldots\cr -\}\cr -}}$$ -Note that a driver must have one static structure, $_dops$, while -it may have as many structures $_info$ as there are minor devices -active. $Register_cdrom()$ builds a linked list from these. - -\subsection{$Void\ unregister_cdrom(struct\ cdrom_device_info * cdi)$} - -Unregistering device $cdi$ with minor number $MINOR(cdi\to dev)$ removes -the minor device from the list. If it was the last registered minor for -the low-level driver, this disconnects the registered device-operation -routines from the \cdrom\ interface. This function returns zero upon -success, and non-zero upon failure. - -\subsection{$Int\ cdrom_open(struct\ inode * ip, struct\ file * fp)$} - -This function is not called directly by the low-level drivers, it is -listed in the standard $cdrom_fops$. If the VFS opens a file, this -function becomes active. A strategy is implemented in this routine, -taking care of all capabilities and options that are set in the -$cdrom_device_ops$ connected to the device. Then, the program flow is -transferred to the device_dependent $open()$ call. - -\subsection{$Void\ cdrom_release(struct\ inode *ip, struct\ file -*fp)$} - -This function implements the reverse-logic of $cdrom_open()$, and then -calls the device-dependent $release()$ routine. When the use-count has -reached 0, the allocated buffers are flushed by calls to $sync_dev(dev)$ -and $invalidate_buffers(dev)$. - - -\subsection{$Int\ cdrom_ioctl(struct\ inode *ip, struct\ file *fp, -unsigned\ int\ cmd, unsigned\ long\ arg)$} -\label{cdrom-ioctl} - -This function handles all the standard $ioctl$ requests for \cdrom\ -devices in a uniform way. The different calls fall into three -categories: $ioctl$s that can be directly implemented by device -operations, ones that are routed through the call $audio_ioctl()$, and -the remaining ones, that are presumable device-dependent. Generally, a -negative return value indicates an error. - -\subsubsection{Directly implemented $ioctl$s} -\label{ioctl-direct} - -The following `old' \cdrom-$ioctl$s are implemented by directly -calling device-operations in $cdrom_device_ops$, if implemented and -not masked: -\begin{description} -\item[CDROMMULTISESSION] Requests the last session on a \cdrom. -\item[CDROMEJECT] Open tray. -\item[CDROMCLOSETRAY] Close tray. -\item[CDROMEJECT_SW] If $arg\not=0$, set behavior to auto-close (close -tray on first open) and auto-eject (eject on last release), otherwise -set behavior to non-moving on $open()$ and $release()$ calls. -\item[CDROM_GET_MCN] Get the Media Catalog Number from a CD. -\end{description} - -\subsubsection{$Ioctl$s routed through $audio_ioctl()$} -\label{ioctl-audio} - -The following set of $ioctl$s are all implemented through a call to -the $cdrom_fops$ function $audio_ioctl()$. Memory checks and -allocation are performed in $cdrom_ioctl()$, and also sanitization of -address format ($CDROM_LBA$/$CDROM_MSF$) is done. -\begin{description} -\item[CDROMSUBCHNL] Get sub-channel data in argument $arg$ of type $struct\ -cdrom_subchnl *{}$. -\item[CDROMREADTOCHDR] Read Table of Contents header, in $arg$ of type -$struct\ cdrom_tochdr *{}$. -\item[CDROMREADTOCENTRY] Read a Table of Contents entry in $arg$ and -specified by $arg$ of type $struct\ cdrom_tocentry *{}$. -\item[CDROMPLAYMSF] Play audio fragment specified in Minute, Second, -Frame format, delimited by $arg$ of type $struct\ cdrom_msf *{}$. -\item[CDROMPLAYTRKIND] Play audio fragment in track-index format -delimited by $arg$ of type $struct\ \penalty-1000 cdrom_ti *{}$. -\item[CDROMVOLCTRL] Set volume specified by $arg$ of type $struct\ -cdrom_volctrl *{}$. -\item[CDROMVOLREAD] Read volume into by $arg$ of type $struct\ -cdrom_volctrl *{}$. -\item[CDROMSTART] Spin up disc. -\item[CDROMSTOP] Stop playback of audio fragment. -\item[CDROMPAUSE] Pause playback of audio fragment. -\item[CDROMRESUME] Resume playing. -\end{description} - -\subsubsection{New $ioctl$s in \cdromc} - -The following $ioctl$s have been introduced to allow user programs to -control the behavior of individual \cdrom\ devices. New $ioctl$ -commands can be identified by the underscores in their names. -\begin{description} -\item[CDROM_SET_OPTIONS] Set options specified by $arg$. Returns the -option flag register after modification. Use $arg = \rm0$ for reading -the current flags. -\item[CDROM_CLEAR_OPTIONS] Clear options specified by $arg$. Returns - the option flag register after modification. -\item[CDROM_SELECT_SPEED] Select head-rate speed of disc specified as - by $arg$ in units of standard cdrom speed (176\,kB/sec raw data or - 150\,kB/sec file system data). The value 0 means `auto-select', \ie, - play audio discs at real time and data discs at maximum speed. The value - $arg$ is checked against the maximum head rate of the drive found in the - $cdrom_dops$. -\item[CDROM_SELECT_DISC] Select disc numbered $arg$ from a juke-box. - First disc is numbered 0. The number $arg$ is checked against the - maximum number of discs in the juke-box found in the $cdrom_dops$. -\item[CDROM_MEDIA_CHANGED] Returns 1 if a disc has been changed since - the last call. Note that calls to $cdrom_media_changed$ by the VFS - are treated by an independent queue, so both mechanisms will detect - a media change once. For juke-boxes, an extra argument $arg$ - specifies the slot for which the information is given. The special - value $CDSL_CURRENT$ requests that information about the currently - selected slot be returned. -\item[CDROM_DRIVE_STATUS] Returns the status of the drive by a call to - $drive_status()$. Return values are defined in section~\ref{drive - status}. Note that this call doesn't return information on the - current playing activity of the drive; this can be polled through an - $ioctl$ call to $CDROMSUBCHNL$. For juke-boxes, an extra argument - $arg$ specifies the slot for which (possibly limited) information is - given. The special value $CDSL_CURRENT$ requests that information - about the currently selected slot be returned. -\item[CDROM_DISC_STATUS] Returns the type of the disc currently in the - drive. It should be viewed as a complement to $CDROM_DRIVE_STATUS$. - This $ioctl$ can provide \emph {some} information about the current - disc that is inserted in the drive. This functionality used to be - implemented in the low level drivers, but is now carried out - entirely in \UCD. - - The history of development of the CD's use as a carrier medium for - various digital information has lead to many different disc types. - This $ioctl$ is useful only in the case that CDs have \emph {only - one} type of data on them. While this is often the case, it is - also very common for CDs to have some tracks with data, and some - tracks with audio. Because this is an existing interface, rather - than fixing this interface by changing the assumptions it was made - under, thereby breaking all user applications that use this - function, the \UCD\ implements this $ioctl$ as follows: If the CD in - question has audio tracks on it, and it has absolutely no CD-I, XA, - or data tracks on it, it will be reported as $CDS_AUDIO$. If it has - both audio and data tracks, it will return $CDS_MIXED$. If there - are no audio tracks on the disc, and if the CD in question has any - CD-I tracks on it, it will be reported as $CDS_XA_2_2$. Failing - that, if the CD in question has any XA tracks on it, it will be - reported as $CDS_XA_2_1$. Finally, if the CD in question has any - data tracks on it, it will be reported as a data CD ($CDS_DATA_1$). - - This $ioctl$ can return: - $$ - \halign{$#$\ \hfil&$/*$ \rm# $*/$\hfil\cr - CDS_NO_INFO& no information available\cr - CDS_NO_DISC& no disc is inserted, or tray is opened\cr - CDS_AUDIO& Audio disc (2352 audio bytes/frame)\cr - CDS_DATA_1& data disc, mode 1 (2048 user bytes/frame)\cr - CDS_XA_2_1& mixed data (XA), mode 2, form 1 (2048 user bytes)\cr - CDS_XA_2_2& mixed data (XA), mode 2, form 1 (2324 user bytes)\cr - CDS_MIXED& mixed audio/data disc\cr - } - $$ - For some information concerning frame layout of the various disc - types, see a recent version of \cdromh. - -\item[CDROM_CHANGER_NSLOTS] Returns the number of slots in a - juke-box. -\item[CDROMRESET] Reset the drive. -\item[CDROM_GET_CAPABILITY] Returns the $capability$ flags for the - drive. Refer to section \ref{capability} for more information on - these flags. -\item[CDROM_LOCKDOOR] Locks the door of the drive. $arg == \rm0$ - unlocks the door, any other value locks it. -\item[CDROM_DEBUG] Turns on debugging info. Only root is allowed - to do this. Same semantics as CDROM_LOCKDOOR. -\end{description} - -\subsubsection{Device dependent $ioctl$s} - -Finally, all other $ioctl$s are passed to the function $dev_ioctl()$, -if implemented. No memory allocation or verification is carried out. - -\newsection{How to update your driver} - -\begin{enumerate} -\item Make a backup of your current driver. -\item Get hold of the files \cdromc\ and \cdromh, they should be in - the directory tree that came with this documentation. -\item Make sure you include \cdromh. -\item Change the 3rd argument of $register_blkdev$ from -$\&_fops$ to $\&cdrom_fops$. -\item Just after that line, add the following to register with the \UCD: - $$register_cdrom(\&_info);$$ - Similarly, add a call to $unregister_cdrom()$ at the appropriate place. -\item Copy an example of the device-operations $struct$ to your - source, \eg, from {\tt {cm206.c}} $cm206_dops$, and change all - entries to names corresponding to your driver, or names you just - happen to like. If your driver doesn't support a certain function, - make the entry $NULL$. At the entry $capability$ you should list all - capabilities your driver currently supports. If your driver - has a capability that is not listed, please send me a message. -\item Copy the $cdrom_device_info$ declaration from the same example - driver, and modify the entries according to your needs. If your - driver dynamically determines the capabilities of the hardware, this - structure should also be declared dynamically. -\item Implement all functions in your $_dops$ structure, - according to prototypes listed in \cdromh, and specifications given - in section~\ref{cdrom.c}. Most likely you have already implemented - the code in a large part, and you will almost certainly need to adapt the - prototype and return values. -\item Rename your $_ioctl()$ function to $audio_ioctl$ and - change the prototype a little. Remove entries listed in the first - part in section~\ref{cdrom-ioctl}, if your code was OK, these are - just calls to the routines you adapted in the previous step. -\item You may remove all remaining memory checking code in the - $audio_ioctl()$ function that deals with audio commands (these are - listed in the second part of section~\ref{cdrom-ioctl}). There is no - need for memory allocation either, so most $case$s in the $switch$ - statement look similar to: - $$ - case\ CDROMREADTOCENTRY\colon get_toc_entry\bigl((struct\ - cdrom_tocentry *{})\ arg\bigr); - $$ -\item All remaining $ioctl$ cases must be moved to a separate - function, $_ioctl$, the device-dependent $ioctl$s. Note that - memory checking and allocation must be kept in this code! -\item Change the prototypes of $_open()$ and - $_release()$, and remove any strategic code (\ie, tray - movement, door locking, etc.). -\item Try to recompile the drivers. We advise you to use modules, both - for {\tt {cdrom.o}} and your driver, as debugging is much easier this - way. -\end{enumerate} - -\newsection{Thanks} - -Thanks to all the people involved. First, Erik Andersen, who has -taken over the torch in maintaining \cdromc\ and integrating much -\cdrom-related code in the 2.1-kernel. Thanks to Scott Snyder and -Gerd Knorr, who were the first to implement this interface for SCSI -and IDE-CD drivers and added many ideas for extension of the data -structures relative to kernel~2.0. Further thanks to Heiko Ei{\ss}feldt, -Thomas Quinot, Jon Tombs, Ken Pizzini, Eberhard M\"onkeberg and Andrew -Kroll, the \linux\ \cdrom\ device driver developers who were kind -enough to give suggestions and criticisms during the writing. Finally -of course, I want to thank Linus Torvalds for making this possible in -the first place. - -\vfill -$ \version\ $ -\eject -\end{document} diff --git a/Documentation/cdrom/cdrom-standard.txt b/Documentation/cdrom/cdrom-standard.txt new file mode 100644 index 000000000000..dde4f7f7fdbf --- /dev/null +++ b/Documentation/cdrom/cdrom-standard.txt @@ -0,0 +1,1063 @@ +======================= +A Linux CD-ROM standard +======================= + +:Author: David van Leeuwen +:Date: 12 March 1999 +:Updated by: Erik Andersen (andersee@debian.org) +:Updated by: Jens Axboe (axboe@image.dk) + + +Introduction +============ + +Linux is probably the Unix-like operating system that supports +the widest variety of hardware devices. The reasons for this are +presumably + +- The large list of hardware devices available for the many platforms + that Linux now supports (i.e., i386-PCs, Sparc Suns, etc.) +- The open design of the operating system, such that anybody can write a + driver for Linux. +- There is plenty of source code around as examples of how to write a driver. + +The openness of Linux, and the many different types of available +hardware has allowed Linux to support many different hardware devices. +Unfortunately, the very openness that has allowed Linux to support +all these different devices has also allowed the behavior of each +device driver to differ significantly from one device to another. +This divergence of behavior has been very significant for CD-ROM +devices; the way a particular drive reacts to a `standard` *ioctl()* +call varies greatly from one device driver to another. To avoid making +their drivers totally inconsistent, the writers of Linux CD-ROM +drivers generally created new device drivers by understanding, copying, +and then changing an existing one. Unfortunately, this practice did not +maintain uniform behavior across all the Linux CD-ROM drivers. + +This document describes an effort to establish Uniform behavior across +all the different CD-ROM device drivers for Linux. This document also +defines the various *ioctl()'s*, and how the low-level CD-ROM device +drivers should implement them. Currently (as of the Linux 2.1.\ *x* +development kernels) several low-level CD-ROM device drivers, including +both IDE/ATAPI and SCSI, now use this Uniform interface. + +When the CD-ROM was developed, the interface between the CD-ROM drive +and the computer was not specified in the standards. As a result, many +different CD-ROM interfaces were developed. Some of them had their +own proprietary design (Sony, Mitsumi, Panasonic, Philips), other +manufacturers adopted an existing electrical interface and changed +the functionality (CreativeLabs/SoundBlaster, Teac, Funai) or simply +adapted their drives to one or more of the already existing electrical +interfaces (Aztech, Sanyo, Funai, Vertos, Longshine, Optics Storage and +most of the `NoName` manufacturers). In cases where a new drive really +brought its own interface or used its own command set and flow control +scheme, either a separate driver had to be written, or an existing +driver had to be enhanced. History has delivered us CD-ROM support for +many of these different interfaces. Nowadays, almost all new CD-ROM +drives are either IDE/ATAPI or SCSI, and it is very unlikely that any +manufacturer will create a new interface. Even finding drives for the +old proprietary interfaces is getting difficult. + +When (in the 1.3.70's) I looked at the existing software interface, +which was expressed through `cdrom.h`, it appeared to be a rather wild +set of commands and data formats [#f1]_. It seemed that many +features of the software interface had been added to accommodate the +capabilities of a particular drive, in an *ad hoc* manner. More +importantly, it appeared that the behavior of the `standard` commands +was different for most of the different drivers: e. g., some drivers +close the tray if an *open()* call occurs when the tray is open, while +others do not. Some drivers lock the door upon opening the device, to +prevent an incoherent file system, but others don't, to allow software +ejection. Undoubtedly, the capabilities of the different drives vary, +but even when two drives have the same capability their drivers' +behavior was usually different. + +.. [#f1] + I cannot recollect what kernel version I looked at, then, + presumably 1.2.13 and 1.3.34 --- the latest kernel that I was + indirectly involved in. + +I decided to start a discussion on how to make all the Linux CD-ROM +drivers behave more uniformly. I began by contacting the developers of +the many CD-ROM drivers found in the Linux kernel. Their reactions +encouraged me to write the Uniform CD-ROM Driver which this document is +intended to describe. The implementation of the Uniform CD-ROM Driver is +in the file `cdrom.c`. This driver is intended to be an additional software +layer that sits on top of the low-level device drivers for each CD-ROM drive. +By adding this additional layer, it is possible to have all the different +CD-ROM devices behave **exactly** the same (insofar as the underlying +hardware will allow). + +The goal of the Uniform CD-ROM Driver is **not** to alienate driver developers +whohave not yet taken steps to support this effort. The goal of Uniform CD-ROM +Driver is simply to give people writing application programs for CD-ROM drives +**one** Linux CD-ROM interface with consistent behavior for all +CD-ROM devices. In addition, this also provides a consistent interface +between the low-level device driver code and the Linux kernel. Care +is taken that 100% compatibility exists with the data structures and +programmer's interface defined in `cdrom.h`. This guide was written to +help CD-ROM driver developers adapt their code to use the Uniform CD-ROM +Driver code defined in `cdrom.c`. + +Personally, I think that the most important hardware interfaces are +the IDE/ATAPI drives and, of course, the SCSI drives, but as prices +of hardware drop continuously, it is also likely that people may have +more than one CD-ROM drive, possibly of mixed types. It is important +that these drives behave in the same way. In December 1994, one of the +cheapest CD-ROM drives was a Philips cm206, a double-speed proprietary +drive. In the months that I was busy writing a Linux driver for it, +proprietary drives became obsolete and IDE/ATAPI drives became the +standard. At the time of the last update to this document (November +1997) it is becoming difficult to even **find** anything less than a +16 speed CD-ROM drive, and 24 speed drives are common. + +.. _cdrom_api: + +Standardizing through another software level +============================================ + +At the time this document was conceived, all drivers directly +implemented the CD-ROM *ioctl()* calls through their own routines. This +led to the danger of different drivers forgetting to do important things +like checking that the user was giving the driver valid data. More +importantly, this led to the divergence of behavior, which has already +been discussed. + +For this reason, the Uniform CD-ROM Driver was created to enforce consistent +CD-ROM drive behavior, and to provide a common set of services to the various +low-level CD-ROM device drivers. The Uniform CD-ROM Driver now provides another +software-level, that separates the *ioctl()* and *open()* implementation +from the actual hardware implementation. Note that this effort has +made few changes which will affect a user's application programs. The +greatest change involved moving the contents of the various low-level +CD-ROM drivers\' header files to the kernel's cdrom directory. This was +done to help ensure that the user is only presented with only one cdrom +interface, the interface defined in `cdrom.h`. + +CD-ROM drives are specific enough (i. e., different from other +block-devices such as floppy or hard disc drives), to define a set +of common **CD-ROM device operations**, *_dops*. +These operations are different from the classical block-device file +operations, *_fops*. + +The routines for the Uniform CD-ROM Driver interface level are implemented +in the file `cdrom.c`. In this file, the Uniform CD-ROM Driver interfaces +with the kernel as a block device by registering the following general +*struct file_operations*:: + + struct file_operations cdrom_fops = { + NULL, /∗ lseek ∗/ + block _read , /∗ read—general block-dev read ∗/ + block _write, /∗ write—general block-dev write ∗/ + NULL, /∗ readdir ∗/ + NULL, /∗ select ∗/ + cdrom_ioctl, /∗ ioctl ∗/ + NULL, /∗ mmap ∗/ + cdrom_open, /∗ open ∗/ + cdrom_release, /∗ release ∗/ + NULL, /∗ fsync ∗/ + NULL, /∗ fasync ∗/ + cdrom_media_changed, /∗ media change ∗/ + NULL /∗ revalidate ∗/ + }; + +Every active CD-ROM device shares this *struct*. The routines +declared above are all implemented in `cdrom.c`, since this file is the +place where the behavior of all CD-ROM-devices is defined and +standardized. The actual interface to the various types of CD-ROM +hardware is still performed by various low-level CD-ROM-device +drivers. These routines simply implement certain **capabilities** +that are common to all CD-ROM (and really, all removable-media +devices). + +Registration of a low-level CD-ROM device driver is now done through +the general routines in `cdrom.c`, not through the Virtual File System +(VFS) any more. The interface implemented in `cdrom.c` is carried out +through two general structures that contain information about the +capabilities of the driver, and the specific drives on which the +driver operates. The structures are: + +cdrom_device_ops + This structure contains information about the low-level driver for a + CD-ROM device. This structure is conceptually connected to the major + number of the device (although some drivers may have different + major numbers, as is the case for the IDE driver). + +cdrom_device_info + This structure contains information about a particular CD-ROM drive, + such as its device name, speed, etc. This structure is conceptually + connected to the minor number of the device. + +Registering a particular CD-ROM drive with the Uniform CD-ROM Driver +is done by the low-level device driver though a call to:: + + register_cdrom(struct cdrom_device_info * _info) + +The device information structure, *_info*, contains all the +information needed for the kernel to interface with the low-level +CD-ROM device driver. One of the most important entries in this +structure is a pointer to the *cdrom_device_ops* structure of the +low-level driver. + +The device operations structure, *cdrom_device_ops*, contains a list +of pointers to the functions which are implemented in the low-level +device driver. When `cdrom.c` accesses a CD-ROM device, it does it +through the functions in this structure. It is impossible to know all +the capabilities of future CD-ROM drives, so it is expected that this +list may need to be expanded from time to time as new technologies are +developed. For example, CD-R and CD-R/W drives are beginning to become +popular, and support will soon need to be added for them. For now, the +current *struct* is:: + + struct cdrom_device_ops { + int (*open)(struct cdrom_device_info *, int) + void (*release)(struct cdrom_device_info *); + int (*drive_status)(struct cdrom_device_info *, int); + unsigned int (*check_events)(struct cdrom_device_info *, + unsigned int, int); + int (*media_changed)(struct cdrom_device_info *, int); + int (*tray_move)(struct cdrom_device_info *, int); + int (*lock_door)(struct cdrom_device_info *, int); + int (*select_speed)(struct cdrom_device_info *, int); + int (*select_disc)(struct cdrom_device_info *, int); + int (*get_last_session) (struct cdrom_device_info *, + struct cdrom_multisession *); + int (*get_mcn)(struct cdrom_device_info *, struct cdrom_mcn *); + int (*reset)(struct cdrom_device_info *); + int (*audio_ioctl)(struct cdrom_device_info *, + unsigned int, void *); + const int capability; /* capability flags */ + int (*generic_packet)(struct cdrom_device_info *, + struct packet_command *); + }; + +When a low-level device driver implements one of these capabilities, +it should add a function pointer to this *struct*. When a particular +function is not implemented, however, this *struct* should contain a +NULL instead. The *capability* flags specify the capabilities of the +CD-ROM hardware and/or low-level CD-ROM driver when a CD-ROM drive +is registered with the Uniform CD-ROM Driver. + +Note that most functions have fewer parameters than their +*blkdev_fops* counterparts. This is because very little of the +information in the structures *inode* and *file* is used. For most +drivers, the main parameter is the *struct* *cdrom_device_info*, from +which the major and minor number can be extracted. (Most low-level +CD-ROM drivers don't even look at the major and minor number though, +since many of them only support one device.) This will be available +through *dev* in *cdrom_device_info* described below. + +The drive-specific, minor-like information that is registered with +`cdrom.c`, currently contains the following fields:: + + struct cdrom_device_info { + const struct cdrom_device_ops * ops; /* device operations for this major */ + struct list_head list; /* linked list of all device_info */ + struct gendisk * disk; /* matching block layer disk */ + void * handle; /* driver-dependent data */ + + int mask; /* mask of capability: disables them */ + int speed; /* maximum speed for reading data */ + int capacity; /* number of discs in a jukebox */ + + unsigned int options:30; /* options flags */ + unsigned mc_flags:2; /* media-change buffer flags */ + unsigned int vfs_events; /* cached events for vfs path */ + unsigned int ioctl_events; /* cached events for ioctl path */ + int use_count; /* number of times device is opened */ + char name[20]; /* name of the device type */ + + __u8 sanyo_slot : 2; /* Sanyo 3-CD changer support */ + __u8 keeplocked : 1; /* CDROM_LOCKDOOR status */ + __u8 reserved : 5; /* not used yet */ + int cdda_method; /* see CDDA_* flags */ + __u8 last_sense; /* saves last sense key */ + __u8 media_written; /* dirty flag, DVD+RW bookkeeping */ + unsigned short mmc3_profile; /* current MMC3 profile */ + int for_data; /* unknown:TBD */ + int (*exit)(struct cdrom_device_info *);/* unknown:TBD */ + int mrw_mode_page; /* which MRW mode page is in use */ + }; + +Using this *struct*, a linked list of the registered minor devices is +built, using the *next* field. The device number, the device operations +struct and specifications of properties of the drive are stored in this +structure. + +The *mask* flags can be used to mask out some of the capabilities listed +in *ops->capability*, if a specific drive doesn't support a feature +of the driver. The value *speed* specifies the maximum head-rate of the +drive, measured in units of normal audio speed (176kB/sec raw data or +150kB/sec file system data). The parameters are declared *const* +because they describe properties of the drive, which don't change after +registration. + +A few registers contain variables local to the CD-ROM drive. The +flags *options* are used to specify how the general CD-ROM routines +should behave. These various flags registers should provide enough +flexibility to adapt to the different users' wishes (and **not** the +`arbitrary` wishes of the author of the low-level device driver, as is +the case in the old scheme). The register *mc_flags* is used to buffer +the information from *media_changed()* to two separate queues. Other +data that is specific to a minor drive, can be accessed through *handle*, +which can point to a data structure specific to the low-level driver. +The fields *use_count*, *next*, *options* and *mc_flags* need not be +initialized. + +The intermediate software layer that `cdrom.c` forms will perform some +additional bookkeeping. The use count of the device (the number of +processes that have the device opened) is registered in *use_count*. The +function *cdrom_ioctl()* will verify the appropriate user-memory regions +for read and write, and in case a location on the CD is transferred, +it will `sanitize` the format by making requests to the low-level +drivers in a standard format, and translating all formats between the +user-software and low level drivers. This relieves much of the drivers' +memory checking and format checking and translation. Also, the necessary +structures will be declared on the program stack. + +The implementation of the functions should be as defined in the +following sections. Two functions **must** be implemented, namely +*open()* and *release()*. Other functions may be omitted, their +corresponding capability flags will be cleared upon registration. +Generally, a function returns zero on success and negative on error. A +function call should return only after the command has completed, but of +course waiting for the device should not use processor time. + +:: + + int open(struct cdrom_device_info *cdi, int purpose) + +*Open()* should try to open the device for a specific *purpose*, which +can be either: + +- Open for reading data, as done by `mount()` (2), or the + user commands `dd` or `cat`. +- Open for *ioctl* commands, as done by audio-CD playing programs. + +Notice that any strategic code (closing tray upon *open()*, etc.) is +done by the calling routine in `cdrom.c`, so the low-level routine +should only be concerned with proper initialization, such as spinning +up the disc, etc. + +:: + + void release(struct cdrom_device_info *cdi) + +Device-specific actions should be taken such as spinning down the device. +However, strategic actions such as ejection of the tray, or unlocking +the door, should be left over to the general routine *cdrom_release()*. +This is the only function returning type *void*. + +.. _cdrom_drive_status: + +:: + + int drive_status(struct cdrom_device_info *cdi, int slot_nr) + +The function *drive_status*, if implemented, should provide +information on the status of the drive (not the status of the disc, +which may or may not be in the drive). If the drive is not a changer, +*slot_nr* should be ignored. In `cdrom.h` the possibilities are listed:: + + + CDS_NO_INFO /* no information available */ + CDS_NO_DISC /* no disc is inserted, tray is closed */ + CDS_TRAY_OPEN /* tray is opened */ + CDS_DRIVE_NOT_READY /* something is wrong, tray is moving? */ + CDS_DISC_OK /* a disc is loaded and everything is fine */ + +:: + + int media_changed(struct cdrom_device_info *cdi, int disc_nr) + +This function is very similar to the original function in $struct +file_operations*. It returns 1 if the medium of the device *cdi->dev* +has changed since the last call, and 0 otherwise. The parameter +*disc_nr* identifies a specific slot in a juke-box, it should be +ignored for single-disc drives. Note that by `re-routing` this +function through *cdrom_media_changed()*, we can implement separate +queues for the VFS and a new *ioctl()* function that can report device +changes to software (e. g., an auto-mounting daemon). + +:: + + int tray_move(struct cdrom_device_info *cdi, int position) + +This function, if implemented, should control the tray movement. (No +other function should control this.) The parameter *position* controls +the desired direction of movement: + +- 0 Close tray +- 1 Open tray + +This function returns 0 upon success, and a non-zero value upon +error. Note that if the tray is already in the desired position, no +action need be taken, and the return value should be 0. + +:: + + int lock_door(struct cdrom_device_info *cdi, int lock) + +This function (and no other code) controls locking of the door, if the +drive allows this. The value of *lock* controls the desired locking +state: + +- 0 Unlock door, manual opening is allowed +- 1 Lock door, tray cannot be ejected manually + +This function returns 0 upon success, and a non-zero value upon +error. Note that if the door is already in the requested state, no +action need be taken, and the return value should be 0. + +:: + + int select_speed(struct cdrom_device_info *cdi, int speed) + +Some CD-ROM drives are capable of changing their head-speed. There +are several reasons for changing the speed of a CD-ROM drive. Badly +pressed CD-ROM s may benefit from less-than-maximum head rate. Modern +CD-ROM drives can obtain very high head rates (up to *24x* is +common). It has been reported that these drives can make reading +errors at these high speeds, reducing the speed can prevent data loss +in these circumstances. Finally, some of these drives can +make an annoyingly loud noise, which a lower speed may reduce. + +This function specifies the speed at which data is read or audio is +played back. The value of *speed* specifies the head-speed of the +drive, measured in units of standard cdrom speed (176kB/sec raw data +or 150kB/sec file system data). So to request that a CD-ROM drive +operate at 300kB/sec you would call the CDROM_SELECT_SPEED *ioctl* +with *speed=2*. The special value `0` means `auto-selection`, i. e., +maximum data-rate or real-time audio rate. If the drive doesn't have +this `auto-selection` capability, the decision should be made on the +current disc loaded and the return value should be positive. A negative +return value indicates an error. + +:: + + int select_disc(struct cdrom_device_info *cdi, int number) + +If the drive can store multiple discs (a juke-box) this function +will perform disc selection. It should return the number of the +selected disc on success, a negative value on error. Currently, only +the ide-cd driver supports this functionality. + +:: + + int get_last_session(struct cdrom_device_info *cdi, + struct cdrom_multisession *ms_info) + +This function should implement the old corresponding *ioctl()*. For +device *cdi->dev*, the start of the last session of the current disc +should be returned in the pointer argument *ms_info*. Note that +routines in `cdrom.c` have sanitized this argument: its requested +format will **always** be of the type *CDROM_LBA* (linear block +addressing mode), whatever the calling software requested. But +sanitization goes even further: the low-level implementation may +return the requested information in *CDROM_MSF* format if it wishes so +(setting the *ms_info->addr_format* field appropriately, of +course) and the routines in `cdrom.c` will make the transformation if +necessary. The return value is 0 upon success. + +:: + + int get_mcn(struct cdrom_device_info *cdi, + struct cdrom_mcn *mcn) + +Some discs carry a `Media Catalog Number` (MCN), also called +`Universal Product Code` (UPC). This number should reflect the number +that is generally found in the bar-code on the product. Unfortunately, +the few discs that carry such a number on the disc don't even use the +same format. The return argument to this function is a pointer to a +pre-declared memory region of type *struct cdrom_mcn*. The MCN is +expected as a 13-character string, terminated by a null-character. + +:: + + int reset(struct cdrom_device_info *cdi) + +This call should perform a hard-reset on the drive (although in +circumstances that a hard-reset is necessary, a drive may very well not +listen to commands anymore). Preferably, control is returned to the +caller only after the drive has finished resetting. If the drive is no +longer listening, it may be wise for the underlying low-level cdrom +driver to time out. + +:: + + int audio_ioctl(struct cdrom_device_info *cdi, + unsigned int cmd, void *arg) + +Some of the CD-ROM-\ *ioctl()*\ 's defined in `cdrom.h` can be +implemented by the routines described above, and hence the function +*cdrom_ioctl* will use those. However, most *ioctl()*\ 's deal with +audio-control. We have decided to leave these to be accessed through a +single function, repeating the arguments *cmd* and *arg*. Note that +the latter is of type *void*, rather than *unsigned long int*. +The routine *cdrom_ioctl()* does do some useful things, +though. It sanitizes the address format type to *CDROM_MSF* (Minutes, +Seconds, Frames) for all audio calls. It also verifies the memory +location of *arg*, and reserves stack-memory for the argument. This +makes implementation of the *audio_ioctl()* much simpler than in the +old driver scheme. For example, you may look up the function +*cm206_audio_ioctl()* `cm206.c` that should be updated with +this documentation. + +An unimplemented ioctl should return *-ENOSYS*, but a harmless request +(e. g., *CDROMSTART*) may be ignored by returning 0 (success). Other +errors should be according to the standards, whatever they are. When +an error is returned by the low-level driver, the Uniform CD-ROM Driver +tries whenever possible to return the error code to the calling program. +(We may decide to sanitize the return value in *cdrom_ioctl()* though, in +order to guarantee a uniform interface to the audio-player software.) + +:: + + int dev_ioctl(struct cdrom_device_info *cdi, + unsigned int cmd, unsigned long arg) + +Some *ioctl()'s* seem to be specific to certain CD-ROM drives. That is, +they are introduced to service some capabilities of certain drives. In +fact, there are 6 different *ioctl()'s* for reading data, either in some +particular kind of format, or audio data. Not many drives support +reading audio tracks as data, I believe this is because of protection +of copyrights of artists. Moreover, I think that if audio-tracks are +supported, it should be done through the VFS and not via *ioctl()'s*. A +problem here could be the fact that audio-frames are 2352 bytes long, +so either the audio-file-system should ask for 75264 bytes at once +(the least common multiple of 512 and 2352), or the drivers should +bend their backs to cope with this incoherence (to which I would be +opposed). Furthermore, it is very difficult for the hardware to find +the exact frame boundaries, since there are no synchronization headers +in audio frames. Once these issues are resolved, this code should be +standardized in `cdrom.c`. + +Because there are so many *ioctl()'s* that seem to be introduced to +satisfy certain drivers [#f2]_, any non-standard *ioctl()*\ s +are routed through the call *dev_ioctl()*. In principle, `private` +*ioctl()*\ 's should be numbered after the device's major number, and not +the general CD-ROM *ioctl* number, `0x53`. Currently the +non-supported *ioctl()'s* are: + + CDROMREADMODE1, CDROMREADMODE2, CDROMREADAUDIO, CDROMREADRAW, + CDROMREADCOOKED, CDROMSEEK, CDROMPLAY-BLK and CDROM-READALL + +.. [#f2] + + Is there software around that actually uses these? I'd be interested! + +.. _cdrom_capabilities: + +CD-ROM capabilities +------------------- + +Instead of just implementing some *ioctl* calls, the interface in +`cdrom.c` supplies the possibility to indicate the **capabilities** +of a CD-ROM drive. This can be done by ORing any number of +capability-constants that are defined in `cdrom.h` at the registration +phase. Currently, the capabilities are any of:: + + CDC_CLOSE_TRAY /* can close tray by software control */ + CDC_OPEN_TRAY /* can open tray */ + CDC_LOCK /* can lock and unlock the door */ + CDC_SELECT_SPEED /* can select speed, in units of * sim*150 ,kB/s */ + CDC_SELECT_DISC /* drive is juke-box */ + CDC_MULTI_SESSION /* can read sessions *> rm1* */ + CDC_MCN /* can read Media Catalog Number */ + CDC_MEDIA_CHANGED /* can report if disc has changed */ + CDC_PLAY_AUDIO /* can perform audio-functions (play, pause, etc) */ + CDC_RESET /* hard reset device */ + CDC_IOCTLS /* driver has non-standard ioctls */ + CDC_DRIVE_STATUS /* driver implements drive status */ + +The capability flag is declared *const*, to prevent drivers from +accidentally tampering with the contents. The capability fags actually +inform `cdrom.c` of what the driver can do. If the drive found +by the driver does not have the capability, is can be masked out by +the *cdrom_device_info* variable *mask*. For instance, the SCSI CD-ROM +driver has implemented the code for loading and ejecting CD-ROM's, and +hence its corresponding flags in *capability* will be set. But a SCSI +CD-ROM drive might be a caddy system, which can't load the tray, and +hence for this drive the *cdrom_device_info* struct will have set +the *CDC_CLOSE_TRAY* bit in *mask*. + +In the file `cdrom.c` you will encounter many constructions of the type:: + + if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ... + +There is no *ioctl* to set the mask... The reason is that +I think it is better to control the **behavior** rather than the +**capabilities**. + +Options +------- + +A final flag register controls the **behavior** of the CD-ROM +drives, in order to satisfy different users' wishes, hopefully +independently of the ideas of the respective author who happened to +have made the drive's support available to the Linux community. The +current behavior options are:: + + CDO_AUTO_CLOSE /* try to close tray upon device open() */ + CDO_AUTO_EJECT /* try to open tray on last device close() */ + CDO_USE_FFLAGS /* use file_pointer->f_flags to indicate purpose for open() */ + CDO_LOCK /* try to lock door if device is opened */ + CDO_CHECK_TYPE /* ensure disc type is data if opened for data */ + +The initial value of this register is +`CDO_AUTO_CLOSE | CDO_USE_FFLAGS | CDO_LOCK`, reflecting my own view on user +interface and software standards. Before you protest, there are two +new *ioctl()'s* implemented in `cdrom.c`, that allow you to control the +behavior by software. These are:: + + CDROM_SET_OPTIONS /* set options specified in (int)arg */ + CDROM_CLEAR_OPTIONS /* clear options specified in (int)arg */ + +One option needs some more explanation: *CDO_USE_FFLAGS*. In the next +newsection we explain what the need for this option is. + +A software package `setcd`, available from the Debian distribution +and `sunsite.unc.edu`, allows user level control of these flags. + + +The need to know the purpose of opening the CD-ROM device +========================================================= + +Traditionally, Unix devices can be used in two different `modes`, +either by reading/writing to the device file, or by issuing +controlling commands to the device, by the device's *ioctl()* +call. The problem with CD-ROM drives, is that they can be used for +two entirely different purposes. One is to mount removable +file systems, CD-ROM's, the other is to play audio CD's. Audio commands +are implemented entirely through *ioctl()\'s*, presumably because the +first implementation (SUN?) has been such. In principle there is +nothing wrong with this, but a good control of the `CD player` demands +that the device can **always** be opened in order to give the +*ioctl* commands, regardless of the state the drive is in. + +On the other hand, when used as a removable-media disc drive (what the +original purpose of CD-ROM s is) we would like to make sure that the +disc drive is ready for operation upon opening the device. In the old +scheme, some CD-ROM drivers don't do any integrity checking, resulting +in a number of i/o errors reported by the VFS to the kernel when an +attempt for mounting a CD-ROM on an empty drive occurs. This is not a +particularly elegant way to find out that there is no CD-ROM inserted; +it more-or-less looks like the old IBM-PC trying to read an empty floppy +drive for a couple of seconds, after which the system complains it +can't read from it. Nowadays we can **sense** the existence of a +removable medium in a drive, and we believe we should exploit that +fact. An integrity check on opening of the device, that verifies the +availability of a CD-ROM and its correct type (data), would be +desirable. + +These two ways of using a CD-ROM drive, principally for data and +secondarily for playing audio discs, have different demands for the +behavior of the *open()* call. Audio use simply wants to open the +device in order to get a file handle which is needed for issuing +*ioctl* commands, while data use wants to open for correct and +reliable data transfer. The only way user programs can indicate what +their *purpose* of opening the device is, is through the *flags* +parameter (see `open(2)`). For CD-ROM devices, these flags aren't +implemented (some drivers implement checking for write-related flags, +but this is not strictly necessary if the device file has correct +permission flags). Most option flags simply don't make sense to +CD-ROM devices: *O_CREAT*, *O_NOCTTY*, *O_TRUNC*, *O_APPEND*, and +*O_SYNC* have no meaning to a CD-ROM. + +We therefore propose to use the flag *O_NONBLOCK* to indicate +that the device is opened just for issuing *ioctl* +commands. Strictly, the meaning of *O_NONBLOCK* is that opening and +subsequent calls to the device don't cause the calling process to +wait. We could interpret this as don't wait until someone has +inserted some valid data-CD-ROM. Thus, our proposal of the +implementation for the *open()* call for CD-ROM s is: + +- If no other flags are set than *O_RDONLY*, the device is opened + for data transfer, and the return value will be 0 only upon successful + initialization of the transfer. The call may even induce some actions + on the CD-ROM, such as closing the tray. +- If the option flag *O_NONBLOCK* is set, opening will always be + successful, unless the whole device doesn't exist. The drive will take + no actions whatsoever. + +And what about standards? +------------------------- + +You might hesitate to accept this proposal as it comes from the +Linux community, and not from some standardizing institute. What +about SUN, SGI, HP and all those other Unix and hardware vendors? +Well, these companies are in the lucky position that they generally +control both the hardware and software of their supported products, +and are large enough to set their own standard. They do not have to +deal with a dozen or more different, competing hardware +configurations\ [#f3]_. + +.. [#f3] + + Incidentally, I think that SUN's approach to mounting CD-ROM s is very + good in origin: under Solaris a volume-daemon automatically mounts a + newly inserted CD-ROM under `/cdrom/**`. + + In my opinion they should have pushed this + further and have **every** CD-ROM on the local area network be + mounted at the similar location, i. e., no matter in which particular + machine you insert a CD-ROM, it will always appear at the same + position in the directory tree, on every system. When I wanted to + implement such a user-program for Linux, I came across the + differences in behavior of the various drivers, and the need for an + *ioctl* informing about media changes. + +We believe that using *O_NONBLOCK* to indicate that a device is being opened +for *ioctl* commands only can be easily introduced in the Linux +community. All the CD-player authors will have to be informed, we can +even send in our own patches to the programs. The use of *O_NONBLOCK* +has most likely no influence on the behavior of the CD-players on +other operating systems than Linux. Finally, a user can always revert +to old behavior by a call to +*ioctl(file_descriptor, CDROM_CLEAR_OPTIONS, CDO_USE_FFLAGS)*. + +The preferred strategy of *open()* +---------------------------------- + +The routines in `cdrom.c` are designed in such a way that run-time +configuration of the behavior of CD-ROM devices (of **any** type) +can be carried out, by the *CDROM_SET/CLEAR_OPTIONS* *ioctls*. Thus, various +modes of operation can be set: + +`CDO_AUTO_CLOSE | CDO_USE_FFLAGS | CDO_LOCK` + This is the default setting. (With *CDO_CHECK_TYPE* it will be better, in + the future.) If the device is not yet opened by any other process, and if + the device is being opened for data (*O_NONBLOCK* is not set) and the + tray is found to be open, an attempt to close the tray is made. Then, + it is verified that a disc is in the drive and, if *CDO_CHECK_TYPE* is + set, that it contains tracks of type `data mode 1`. Only if all tests + are passed is the return value zero. The door is locked to prevent file + system corruption. If the drive is opened for audio (*O_NONBLOCK* is + set), no actions are taken and a value of 0 will be returned. + +`CDO_AUTO_CLOSE | CDO_AUTO_EJECT | CDO_LOCK` + This mimics the behavior of the current sbpcd-driver. The option flags are + ignored, the tray is closed on the first open, if necessary. Similarly, + the tray is opened on the last release, i. e., if a CD-ROM is unmounted, + it is automatically ejected, such that the user can replace it. + +We hope that these option can convince everybody (both driver +maintainers and user program developers) to adopt the new CD-ROM +driver scheme and option flag interpretation. + +Description of routines in `cdrom.c` +==================================== + +Only a few routines in `cdrom.c` are exported to the drivers. In this +new section we will discuss these, as well as the functions that `take +over' the CD-ROM interface to the kernel. The header file belonging +to `cdrom.c` is called `cdrom.h`. Formerly, some of the contents of this +file were placed in the file `ucdrom.h`, but this file has now been +merged back into `cdrom.h`. + +:: + + struct file_operations cdrom_fops + +The contents of this structure were described in cdrom_api_. +A pointer to this structure is assigned to the *fops* field +of the *struct gendisk*. + +:: + + int register_cdrom(struct cdrom_device_info *cdi) + +This function is used in about the same way one registers *cdrom_fops* +with the kernel, the device operations and information structures, +as described in cdrom_api_, should be registered with the +Uniform CD-ROM Driver:: + + register_cdrom(&_info); + + +This function returns zero upon success, and non-zero upon +failure. The structure *_info* should have a pointer to the +driver's *_dops*, as in:: + + struct cdrom_device_info _info = { + _dops; + ... + } + +Note that a driver must have one static structure, *_dops*, while +it may have as many structures *_info* as there are minor devices +active. *Register_cdrom()* builds a linked list from these. + + +:: + + void unregister_cdrom(struct cdrom_device_info *cdi) + +Unregistering device *cdi* with minor number *MINOR(cdi->dev)* removes +the minor device from the list. If it was the last registered minor for +the low-level driver, this disconnects the registered device-operation +routines from the CD-ROM interface. This function returns zero upon +success, and non-zero upon failure. + +:: + + int cdrom_open(struct inode * ip, struct file * fp) + +This function is not called directly by the low-level drivers, it is +listed in the standard *cdrom_fops*. If the VFS opens a file, this +function becomes active. A strategy is implemented in this routine, +taking care of all capabilities and options that are set in the +*cdrom_device_ops* connected to the device. Then, the program flow is +transferred to the device_dependent *open()* call. + +:: + + void cdrom_release(struct inode *ip, struct file *fp) + +This function implements the reverse-logic of *cdrom_open()*, and then +calls the device-dependent *release()* routine. When the use-count has +reached 0, the allocated buffers are flushed by calls to *sync_dev(dev)* +and *invalidate_buffers(dev)*. + + +.. _cdrom_ioctl: + +:: + + int cdrom_ioctl(struct inode *ip, struct file *fp, + unsigned int cmd, unsigned long arg) + +This function handles all the standard *ioctl* requests for CD-ROM +devices in a uniform way. The different calls fall into three +categories: *ioctl()'s* that can be directly implemented by device +operations, ones that are routed through the call *audio_ioctl()*, and +the remaining ones, that are presumable device-dependent. Generally, a +negative return value indicates an error. + +Directly implemented *ioctl()'s* +-------------------------------- + +The following `old` CD-ROM *ioctl()*\ 's are implemented by directly +calling device-operations in *cdrom_device_ops*, if implemented and +not masked: + +`CDROMMULTISESSION` + Requests the last session on a CD-ROM. +`CDROMEJECT` + Open tray. +`CDROMCLOSETRAY` + Close tray. +`CDROMEJECT_SW` + If *arg\not=0*, set behavior to auto-close (close + tray on first open) and auto-eject (eject on last release), otherwise + set behavior to non-moving on *open()* and *release()* calls. +`CDROM_GET_MCN` + Get the Media Catalog Number from a CD. + +*Ioctl*s routed through *audio_ioctl()* +--------------------------------------- + +The following set of *ioctl()'s* are all implemented through a call to +the *cdrom_fops* function *audio_ioctl()*. Memory checks and +allocation are performed in *cdrom_ioctl()*, and also sanitization of +address format (*CDROM_LBA*/*CDROM_MSF*) is done. + +`CDROMSUBCHNL` + Get sub-channel data in argument *arg* of type + `struct cdrom_subchnl *`. +`CDROMREADTOCHDR` + Read Table of Contents header, in *arg* of type + `struct cdrom_tochdr *`. +`CDROMREADTOCENTRY` + Read a Table of Contents entry in *arg* and specified by *arg* + of type `struct cdrom_tocentry *`. +`CDROMPLAYMSF` + Play audio fragment specified in Minute, Second, Frame format, + delimited by *arg* of type `struct cdrom_msf *`. +`CDROMPLAYTRKIND` + Play audio fragment in track-index format delimited by *arg* + of type `struct cdrom_ti *`. +`CDROMVOLCTRL` + Set volume specified by *arg* of type `struct cdrom_volctrl *`. +`CDROMVOLREAD` + Read volume into by *arg* of type `struct cdrom_volctrl *`. +`CDROMSTART` + Spin up disc. +`CDROMSTOP` + Stop playback of audio fragment. +`CDROMPAUSE` + Pause playback of audio fragment. +`CDROMRESUME` + Resume playing. + +New *ioctl()'s* in `cdrom.c` +---------------------------- + +The following *ioctl()'s* have been introduced to allow user programs to +control the behavior of individual CD-ROM devices. New *ioctl* +commands can be identified by the underscores in their names. + +`CDROM_SET_OPTIONS` + Set options specified by *arg*. Returns the option flag register + after modification. Use *arg = \rm0* for reading the current flags. +`CDROM_CLEAR_OPTIONS` + Clear options specified by *arg*. Returns the option flag register + after modification. +`CDROM_SELECT_SPEED` + Select head-rate speed of disc specified as by *arg* in units + of standard cdrom speed (176\,kB/sec raw data or + 150kB/sec file system data). The value 0 means `auto-select`, + i. e., play audio discs at real time and data discs at maximum speed. + The value *arg* is checked against the maximum head rate of the + drive found in the *cdrom_dops*. +`CDROM_SELECT_DISC` + Select disc numbered *arg* from a juke-box. + + First disc is numbered 0. The number *arg* is checked against the + maximum number of discs in the juke-box found in the *cdrom_dops*. +`CDROM_MEDIA_CHANGED` + Returns 1 if a disc has been changed since the last call. + Note that calls to *cdrom_media_changed* by the VFS are treated + by an independent queue, so both mechanisms will detect a + media change once. For juke-boxes, an extra argument *arg* + specifies the slot for which the information is given. The special + value *CDSL_CURRENT* requests that information about the currently + selected slot be returned. +`CDROM_DRIVE_STATUS` + Returns the status of the drive by a call to + *drive_status()*. Return values are defined in cdrom_drive_status_. + Note that this call doesn't return information on the + current playing activity of the drive; this can be polled through + an *ioctl* call to *CDROMSUBCHNL*. For juke-boxes, an extra argument + *arg* specifies the slot for which (possibly limited) information is + given. The special value *CDSL_CURRENT* requests that information + about the currently selected slot be returned. +`CDROM_DISC_STATUS` + Returns the type of the disc currently in the drive. + It should be viewed as a complement to *CDROM_DRIVE_STATUS*. + This *ioctl* can provide *some* information about the current + disc that is inserted in the drive. This functionality used to be + implemented in the low level drivers, but is now carried out + entirely in Uniform CD-ROM Driver. + + The history of development of the CD's use as a carrier medium for + various digital information has lead to many different disc types. + This *ioctl* is useful only in the case that CDs have \emph {only + one} type of data on them. While this is often the case, it is + also very common for CDs to have some tracks with data, and some + tracks with audio. Because this is an existing interface, rather + than fixing this interface by changing the assumptions it was made + under, thereby breaking all user applications that use this + function, the Uniform CD-ROM Driver implements this *ioctl* as + follows: If the CD in question has audio tracks on it, and it has + absolutely no CD-I, XA, or data tracks on it, it will be reported + as *CDS_AUDIO*. If it has both audio and data tracks, it will + return *CDS_MIXED*. If there are no audio tracks on the disc, and + if the CD in question has any CD-I tracks on it, it will be + reported as *CDS_XA_2_2*. Failing that, if the CD in question + has any XA tracks on it, it will be reported as *CDS_XA_2_1*. + Finally, if the CD in question has any data tracks on it, + it will be reported as a data CD (*CDS_DATA_1*). + + This *ioctl* can return:: + + CDS_NO_INFO /* no information available */ + CDS_NO_DISC /* no disc is inserted, or tray is opened */ + CDS_AUDIO /* Audio disc (2352 audio bytes/frame) */ + CDS_DATA_1 /* data disc, mode 1 (2048 user bytes/frame) */ + CDS_XA_2_1 /* mixed data (XA), mode 2, form 1 (2048 user bytes) */ + CDS_XA_2_2 /* mixed data (XA), mode 2, form 1 (2324 user bytes) */ + CDS_MIXED /* mixed audio/data disc */ + + For some information concerning frame layout of the various disc + types, see a recent version of `cdrom.h`. + +`CDROM_CHANGER_NSLOTS` + Returns the number of slots in a juke-box. +`CDROMRESET` + Reset the drive. +`CDROM_GET_CAPABILITY` + Returns the *capability* flags for the drive. Refer to section + cdrom_capabilities_ for more information on these flags. +`CDROM_LOCKDOOR` + Locks the door of the drive. `arg == 0` unlocks the door, + any other value locks it. +`CDROM_DEBUG` + Turns on debugging info. Only root is allowed to do this. + Same semantics as CDROM_LOCKDOOR. + + +Device dependent *ioctl()'s* +---------------------------- + +Finally, all other *ioctl()'s* are passed to the function *dev_ioctl()*, +if implemented. No memory allocation or verification is carried out. + +How to update your driver +========================= + +- Make a backup of your current driver. +- Get hold of the files `cdrom.c` and `cdrom.h`, they should be in + the directory tree that came with this documentation. +- Make sure you include `cdrom.h`. +- Change the 3rd argument of *register_blkdev* from `&_fops` + to `&cdrom_fops`. +- Just after that line, add the following to register with the Uniform + CD-ROM Driver:: + + register_cdrom(&_info);* + + Similarly, add a call to *unregister_cdrom()* at the appropriate place. +- Copy an example of the device-operations *struct* to your + source, e. g., from `cm206.c` *cm206_dops*, and change all + entries to names corresponding to your driver, or names you just + happen to like. If your driver doesn't support a certain function, + make the entry *NULL*. At the entry *capability* you should list all + capabilities your driver currently supports. If your driver + has a capability that is not listed, please send me a message. +- Copy the *cdrom_device_info* declaration from the same example + driver, and modify the entries according to your needs. If your + driver dynamically determines the capabilities of the hardware, this + structure should also be declared dynamically. +- Implement all functions in your `_dops` structure, + according to prototypes listed in `cdrom.h`, and specifications given + in cdrom_api_. Most likely you have already implemented + the code in a large part, and you will almost certainly need to adapt the + prototype and return values. +- Rename your `_ioctl()` function to *audio_ioctl* and + change the prototype a little. Remove entries listed in the first + part in cdrom_ioctl_, if your code was OK, these are + just calls to the routines you adapted in the previous step. +- You may remove all remaining memory checking code in the + *audio_ioctl()* function that deals with audio commands (these are + listed in the second part of cdrom_ioctl_. There is no + need for memory allocation either, so most *case*s in the *switch* + statement look similar to:: + + case CDROMREADTOCENTRY: + get_toc_entry\bigl((struct cdrom_tocentry *) arg); + +- All remaining *ioctl* cases must be moved to a separate + function, *_ioctl*, the device-dependent *ioctl()'s*. Note that + memory checking and allocation must be kept in this code! +- Change the prototypes of *_open()* and + *_release()*, and remove any strategic code (i. e., tray + movement, door locking, etc.). +- Try to recompile the drivers. We advise you to use modules, both + for `cdrom.o` and your driver, as debugging is much easier this + way. + +Thanks +====== + +Thanks to all the people involved. First, Erik Andersen, who has +taken over the torch in maintaining `cdrom.c` and integrating much +CD-ROM-related code in the 2.1-kernel. Thanks to Scott Snyder and +Gerd Knorr, who were the first to implement this interface for SCSI +and IDE-CD drivers and added many ideas for extension of the data +structures relative to kernel~2.0. Further thanks to Heiko Eißfeldt, +Thomas Quinot, Jon Tombs, Ken Pizzini, Eberhard Mönkeberg and Andrew Kroll, +the Linux CD-ROM device driver developers who were kind +enough to give suggestions and criticisms during the writing. Finally +of course, I want to thank Linus Torvalds for making this possible in +the first place. diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c index 933268b8d6a5..5d1e0a4a7d84 100644 --- a/drivers/cdrom/cdrom.c +++ b/drivers/cdrom/cdrom.c @@ -7,7 +7,7 @@ License. See linux/COPYING for more information. Uniform CD-ROM driver for Linux. - See Documentation/cdrom/cdrom-standard.tex for usage information. + See Documentation/cdrom/cdrom-standard.txt for usage information. The routines in the file provide a uniform interface between the software that uses CD-ROMs and the various low-level drivers that -- cgit v1.2.3 From 8ea618899b6b4fbe97c8462e7d769867307de011 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 12 Jun 2019 14:52:40 -0300 Subject: docs: cdrom: convert docs to ReST and rename to *.rst The stuff there is almost already at ReST format. A conversion for them is trivial: just add a missing titles and fix some scape codes for them to match ReST syntax. While here, rename the cdrom-standard.txt, with was converted from LaTeX to ReST on the previous patch, and add it to the index file. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/cdrom/cdrom-standard.rst | 1063 ++++++++++++++++++++++++++++++++ Documentation/cdrom/cdrom-standard.txt | 1063 -------------------------------- Documentation/cdrom/ide-cd | 534 ---------------- Documentation/cdrom/ide-cd.rst | 538 ++++++++++++++++ Documentation/cdrom/index.rst | 19 + Documentation/cdrom/packet-writing.rst | 139 +++++ Documentation/cdrom/packet-writing.txt | 132 ---- MAINTAINERS | 2 +- drivers/block/Kconfig | 2 +- drivers/cdrom/cdrom.c | 2 +- drivers/ide/ide-cd.c | 2 +- 11 files changed, 1763 insertions(+), 1733 deletions(-) create mode 100644 Documentation/cdrom/cdrom-standard.rst delete mode 100644 Documentation/cdrom/cdrom-standard.txt delete mode 100644 Documentation/cdrom/ide-cd create mode 100644 Documentation/cdrom/ide-cd.rst create mode 100644 Documentation/cdrom/index.rst create mode 100644 Documentation/cdrom/packet-writing.rst delete mode 100644 Documentation/cdrom/packet-writing.txt (limited to 'drivers') diff --git a/Documentation/cdrom/cdrom-standard.rst b/Documentation/cdrom/cdrom-standard.rst new file mode 100644 index 000000000000..dde4f7f7fdbf --- /dev/null +++ b/Documentation/cdrom/cdrom-standard.rst @@ -0,0 +1,1063 @@ +======================= +A Linux CD-ROM standard +======================= + +:Author: David van Leeuwen +:Date: 12 March 1999 +:Updated by: Erik Andersen (andersee@debian.org) +:Updated by: Jens Axboe (axboe@image.dk) + + +Introduction +============ + +Linux is probably the Unix-like operating system that supports +the widest variety of hardware devices. The reasons for this are +presumably + +- The large list of hardware devices available for the many platforms + that Linux now supports (i.e., i386-PCs, Sparc Suns, etc.) +- The open design of the operating system, such that anybody can write a + driver for Linux. +- There is plenty of source code around as examples of how to write a driver. + +The openness of Linux, and the many different types of available +hardware has allowed Linux to support many different hardware devices. +Unfortunately, the very openness that has allowed Linux to support +all these different devices has also allowed the behavior of each +device driver to differ significantly from one device to another. +This divergence of behavior has been very significant for CD-ROM +devices; the way a particular drive reacts to a `standard` *ioctl()* +call varies greatly from one device driver to another. To avoid making +their drivers totally inconsistent, the writers of Linux CD-ROM +drivers generally created new device drivers by understanding, copying, +and then changing an existing one. Unfortunately, this practice did not +maintain uniform behavior across all the Linux CD-ROM drivers. + +This document describes an effort to establish Uniform behavior across +all the different CD-ROM device drivers for Linux. This document also +defines the various *ioctl()'s*, and how the low-level CD-ROM device +drivers should implement them. Currently (as of the Linux 2.1.\ *x* +development kernels) several low-level CD-ROM device drivers, including +both IDE/ATAPI and SCSI, now use this Uniform interface. + +When the CD-ROM was developed, the interface between the CD-ROM drive +and the computer was not specified in the standards. As a result, many +different CD-ROM interfaces were developed. Some of them had their +own proprietary design (Sony, Mitsumi, Panasonic, Philips), other +manufacturers adopted an existing electrical interface and changed +the functionality (CreativeLabs/SoundBlaster, Teac, Funai) or simply +adapted their drives to one or more of the already existing electrical +interfaces (Aztech, Sanyo, Funai, Vertos, Longshine, Optics Storage and +most of the `NoName` manufacturers). In cases where a new drive really +brought its own interface or used its own command set and flow control +scheme, either a separate driver had to be written, or an existing +driver had to be enhanced. History has delivered us CD-ROM support for +many of these different interfaces. Nowadays, almost all new CD-ROM +drives are either IDE/ATAPI or SCSI, and it is very unlikely that any +manufacturer will create a new interface. Even finding drives for the +old proprietary interfaces is getting difficult. + +When (in the 1.3.70's) I looked at the existing software interface, +which was expressed through `cdrom.h`, it appeared to be a rather wild +set of commands and data formats [#f1]_. It seemed that many +features of the software interface had been added to accommodate the +capabilities of a particular drive, in an *ad hoc* manner. More +importantly, it appeared that the behavior of the `standard` commands +was different for most of the different drivers: e. g., some drivers +close the tray if an *open()* call occurs when the tray is open, while +others do not. Some drivers lock the door upon opening the device, to +prevent an incoherent file system, but others don't, to allow software +ejection. Undoubtedly, the capabilities of the different drives vary, +but even when two drives have the same capability their drivers' +behavior was usually different. + +.. [#f1] + I cannot recollect what kernel version I looked at, then, + presumably 1.2.13 and 1.3.34 --- the latest kernel that I was + indirectly involved in. + +I decided to start a discussion on how to make all the Linux CD-ROM +drivers behave more uniformly. I began by contacting the developers of +the many CD-ROM drivers found in the Linux kernel. Their reactions +encouraged me to write the Uniform CD-ROM Driver which this document is +intended to describe. The implementation of the Uniform CD-ROM Driver is +in the file `cdrom.c`. This driver is intended to be an additional software +layer that sits on top of the low-level device drivers for each CD-ROM drive. +By adding this additional layer, it is possible to have all the different +CD-ROM devices behave **exactly** the same (insofar as the underlying +hardware will allow). + +The goal of the Uniform CD-ROM Driver is **not** to alienate driver developers +whohave not yet taken steps to support this effort. The goal of Uniform CD-ROM +Driver is simply to give people writing application programs for CD-ROM drives +**one** Linux CD-ROM interface with consistent behavior for all +CD-ROM devices. In addition, this also provides a consistent interface +between the low-level device driver code and the Linux kernel. Care +is taken that 100% compatibility exists with the data structures and +programmer's interface defined in `cdrom.h`. This guide was written to +help CD-ROM driver developers adapt their code to use the Uniform CD-ROM +Driver code defined in `cdrom.c`. + +Personally, I think that the most important hardware interfaces are +the IDE/ATAPI drives and, of course, the SCSI drives, but as prices +of hardware drop continuously, it is also likely that people may have +more than one CD-ROM drive, possibly of mixed types. It is important +that these drives behave in the same way. In December 1994, one of the +cheapest CD-ROM drives was a Philips cm206, a double-speed proprietary +drive. In the months that I was busy writing a Linux driver for it, +proprietary drives became obsolete and IDE/ATAPI drives became the +standard. At the time of the last update to this document (November +1997) it is becoming difficult to even **find** anything less than a +16 speed CD-ROM drive, and 24 speed drives are common. + +.. _cdrom_api: + +Standardizing through another software level +============================================ + +At the time this document was conceived, all drivers directly +implemented the CD-ROM *ioctl()* calls through their own routines. This +led to the danger of different drivers forgetting to do important things +like checking that the user was giving the driver valid data. More +importantly, this led to the divergence of behavior, which has already +been discussed. + +For this reason, the Uniform CD-ROM Driver was created to enforce consistent +CD-ROM drive behavior, and to provide a common set of services to the various +low-level CD-ROM device drivers. The Uniform CD-ROM Driver now provides another +software-level, that separates the *ioctl()* and *open()* implementation +from the actual hardware implementation. Note that this effort has +made few changes which will affect a user's application programs. The +greatest change involved moving the contents of the various low-level +CD-ROM drivers\' header files to the kernel's cdrom directory. This was +done to help ensure that the user is only presented with only one cdrom +interface, the interface defined in `cdrom.h`. + +CD-ROM drives are specific enough (i. e., different from other +block-devices such as floppy or hard disc drives), to define a set +of common **CD-ROM device operations**, *_dops*. +These operations are different from the classical block-device file +operations, *_fops*. + +The routines for the Uniform CD-ROM Driver interface level are implemented +in the file `cdrom.c`. In this file, the Uniform CD-ROM Driver interfaces +with the kernel as a block device by registering the following general +*struct file_operations*:: + + struct file_operations cdrom_fops = { + NULL, /∗ lseek ∗/ + block _read , /∗ read—general block-dev read ∗/ + block _write, /∗ write—general block-dev write ∗/ + NULL, /∗ readdir ∗/ + NULL, /∗ select ∗/ + cdrom_ioctl, /∗ ioctl ∗/ + NULL, /∗ mmap ∗/ + cdrom_open, /∗ open ∗/ + cdrom_release, /∗ release ∗/ + NULL, /∗ fsync ∗/ + NULL, /∗ fasync ∗/ + cdrom_media_changed, /∗ media change ∗/ + NULL /∗ revalidate ∗/ + }; + +Every active CD-ROM device shares this *struct*. The routines +declared above are all implemented in `cdrom.c`, since this file is the +place where the behavior of all CD-ROM-devices is defined and +standardized. The actual interface to the various types of CD-ROM +hardware is still performed by various low-level CD-ROM-device +drivers. These routines simply implement certain **capabilities** +that are common to all CD-ROM (and really, all removable-media +devices). + +Registration of a low-level CD-ROM device driver is now done through +the general routines in `cdrom.c`, not through the Virtual File System +(VFS) any more. The interface implemented in `cdrom.c` is carried out +through two general structures that contain information about the +capabilities of the driver, and the specific drives on which the +driver operates. The structures are: + +cdrom_device_ops + This structure contains information about the low-level driver for a + CD-ROM device. This structure is conceptually connected to the major + number of the device (although some drivers may have different + major numbers, as is the case for the IDE driver). + +cdrom_device_info + This structure contains information about a particular CD-ROM drive, + such as its device name, speed, etc. This structure is conceptually + connected to the minor number of the device. + +Registering a particular CD-ROM drive with the Uniform CD-ROM Driver +is done by the low-level device driver though a call to:: + + register_cdrom(struct cdrom_device_info * _info) + +The device information structure, *_info*, contains all the +information needed for the kernel to interface with the low-level +CD-ROM device driver. One of the most important entries in this +structure is a pointer to the *cdrom_device_ops* structure of the +low-level driver. + +The device operations structure, *cdrom_device_ops*, contains a list +of pointers to the functions which are implemented in the low-level +device driver. When `cdrom.c` accesses a CD-ROM device, it does it +through the functions in this structure. It is impossible to know all +the capabilities of future CD-ROM drives, so it is expected that this +list may need to be expanded from time to time as new technologies are +developed. For example, CD-R and CD-R/W drives are beginning to become +popular, and support will soon need to be added for them. For now, the +current *struct* is:: + + struct cdrom_device_ops { + int (*open)(struct cdrom_device_info *, int) + void (*release)(struct cdrom_device_info *); + int (*drive_status)(struct cdrom_device_info *, int); + unsigned int (*check_events)(struct cdrom_device_info *, + unsigned int, int); + int (*media_changed)(struct cdrom_device_info *, int); + int (*tray_move)(struct cdrom_device_info *, int); + int (*lock_door)(struct cdrom_device_info *, int); + int (*select_speed)(struct cdrom_device_info *, int); + int (*select_disc)(struct cdrom_device_info *, int); + int (*get_last_session) (struct cdrom_device_info *, + struct cdrom_multisession *); + int (*get_mcn)(struct cdrom_device_info *, struct cdrom_mcn *); + int (*reset)(struct cdrom_device_info *); + int (*audio_ioctl)(struct cdrom_device_info *, + unsigned int, void *); + const int capability; /* capability flags */ + int (*generic_packet)(struct cdrom_device_info *, + struct packet_command *); + }; + +When a low-level device driver implements one of these capabilities, +it should add a function pointer to this *struct*. When a particular +function is not implemented, however, this *struct* should contain a +NULL instead. The *capability* flags specify the capabilities of the +CD-ROM hardware and/or low-level CD-ROM driver when a CD-ROM drive +is registered with the Uniform CD-ROM Driver. + +Note that most functions have fewer parameters than their +*blkdev_fops* counterparts. This is because very little of the +information in the structures *inode* and *file* is used. For most +drivers, the main parameter is the *struct* *cdrom_device_info*, from +which the major and minor number can be extracted. (Most low-level +CD-ROM drivers don't even look at the major and minor number though, +since many of them only support one device.) This will be available +through *dev* in *cdrom_device_info* described below. + +The drive-specific, minor-like information that is registered with +`cdrom.c`, currently contains the following fields:: + + struct cdrom_device_info { + const struct cdrom_device_ops * ops; /* device operations for this major */ + struct list_head list; /* linked list of all device_info */ + struct gendisk * disk; /* matching block layer disk */ + void * handle; /* driver-dependent data */ + + int mask; /* mask of capability: disables them */ + int speed; /* maximum speed for reading data */ + int capacity; /* number of discs in a jukebox */ + + unsigned int options:30; /* options flags */ + unsigned mc_flags:2; /* media-change buffer flags */ + unsigned int vfs_events; /* cached events for vfs path */ + unsigned int ioctl_events; /* cached events for ioctl path */ + int use_count; /* number of times device is opened */ + char name[20]; /* name of the device type */ + + __u8 sanyo_slot : 2; /* Sanyo 3-CD changer support */ + __u8 keeplocked : 1; /* CDROM_LOCKDOOR status */ + __u8 reserved : 5; /* not used yet */ + int cdda_method; /* see CDDA_* flags */ + __u8 last_sense; /* saves last sense key */ + __u8 media_written; /* dirty flag, DVD+RW bookkeeping */ + unsigned short mmc3_profile; /* current MMC3 profile */ + int for_data; /* unknown:TBD */ + int (*exit)(struct cdrom_device_info *);/* unknown:TBD */ + int mrw_mode_page; /* which MRW mode page is in use */ + }; + +Using this *struct*, a linked list of the registered minor devices is +built, using the *next* field. The device number, the device operations +struct and specifications of properties of the drive are stored in this +structure. + +The *mask* flags can be used to mask out some of the capabilities listed +in *ops->capability*, if a specific drive doesn't support a feature +of the driver. The value *speed* specifies the maximum head-rate of the +drive, measured in units of normal audio speed (176kB/sec raw data or +150kB/sec file system data). The parameters are declared *const* +because they describe properties of the drive, which don't change after +registration. + +A few registers contain variables local to the CD-ROM drive. The +flags *options* are used to specify how the general CD-ROM routines +should behave. These various flags registers should provide enough +flexibility to adapt to the different users' wishes (and **not** the +`arbitrary` wishes of the author of the low-level device driver, as is +the case in the old scheme). The register *mc_flags* is used to buffer +the information from *media_changed()* to two separate queues. Other +data that is specific to a minor drive, can be accessed through *handle*, +which can point to a data structure specific to the low-level driver. +The fields *use_count*, *next*, *options* and *mc_flags* need not be +initialized. + +The intermediate software layer that `cdrom.c` forms will perform some +additional bookkeeping. The use count of the device (the number of +processes that have the device opened) is registered in *use_count*. The +function *cdrom_ioctl()* will verify the appropriate user-memory regions +for read and write, and in case a location on the CD is transferred, +it will `sanitize` the format by making requests to the low-level +drivers in a standard format, and translating all formats between the +user-software and low level drivers. This relieves much of the drivers' +memory checking and format checking and translation. Also, the necessary +structures will be declared on the program stack. + +The implementation of the functions should be as defined in the +following sections. Two functions **must** be implemented, namely +*open()* and *release()*. Other functions may be omitted, their +corresponding capability flags will be cleared upon registration. +Generally, a function returns zero on success and negative on error. A +function call should return only after the command has completed, but of +course waiting for the device should not use processor time. + +:: + + int open(struct cdrom_device_info *cdi, int purpose) + +*Open()* should try to open the device for a specific *purpose*, which +can be either: + +- Open for reading data, as done by `mount()` (2), or the + user commands `dd` or `cat`. +- Open for *ioctl* commands, as done by audio-CD playing programs. + +Notice that any strategic code (closing tray upon *open()*, etc.) is +done by the calling routine in `cdrom.c`, so the low-level routine +should only be concerned with proper initialization, such as spinning +up the disc, etc. + +:: + + void release(struct cdrom_device_info *cdi) + +Device-specific actions should be taken such as spinning down the device. +However, strategic actions such as ejection of the tray, or unlocking +the door, should be left over to the general routine *cdrom_release()*. +This is the only function returning type *void*. + +.. _cdrom_drive_status: + +:: + + int drive_status(struct cdrom_device_info *cdi, int slot_nr) + +The function *drive_status*, if implemented, should provide +information on the status of the drive (not the status of the disc, +which may or may not be in the drive). If the drive is not a changer, +*slot_nr* should be ignored. In `cdrom.h` the possibilities are listed:: + + + CDS_NO_INFO /* no information available */ + CDS_NO_DISC /* no disc is inserted, tray is closed */ + CDS_TRAY_OPEN /* tray is opened */ + CDS_DRIVE_NOT_READY /* something is wrong, tray is moving? */ + CDS_DISC_OK /* a disc is loaded and everything is fine */ + +:: + + int media_changed(struct cdrom_device_info *cdi, int disc_nr) + +This function is very similar to the original function in $struct +file_operations*. It returns 1 if the medium of the device *cdi->dev* +has changed since the last call, and 0 otherwise. The parameter +*disc_nr* identifies a specific slot in a juke-box, it should be +ignored for single-disc drives. Note that by `re-routing` this +function through *cdrom_media_changed()*, we can implement separate +queues for the VFS and a new *ioctl()* function that can report device +changes to software (e. g., an auto-mounting daemon). + +:: + + int tray_move(struct cdrom_device_info *cdi, int position) + +This function, if implemented, should control the tray movement. (No +other function should control this.) The parameter *position* controls +the desired direction of movement: + +- 0 Close tray +- 1 Open tray + +This function returns 0 upon success, and a non-zero value upon +error. Note that if the tray is already in the desired position, no +action need be taken, and the return value should be 0. + +:: + + int lock_door(struct cdrom_device_info *cdi, int lock) + +This function (and no other code) controls locking of the door, if the +drive allows this. The value of *lock* controls the desired locking +state: + +- 0 Unlock door, manual opening is allowed +- 1 Lock door, tray cannot be ejected manually + +This function returns 0 upon success, and a non-zero value upon +error. Note that if the door is already in the requested state, no +action need be taken, and the return value should be 0. + +:: + + int select_speed(struct cdrom_device_info *cdi, int speed) + +Some CD-ROM drives are capable of changing their head-speed. There +are several reasons for changing the speed of a CD-ROM drive. Badly +pressed CD-ROM s may benefit from less-than-maximum head rate. Modern +CD-ROM drives can obtain very high head rates (up to *24x* is +common). It has been reported that these drives can make reading +errors at these high speeds, reducing the speed can prevent data loss +in these circumstances. Finally, some of these drives can +make an annoyingly loud noise, which a lower speed may reduce. + +This function specifies the speed at which data is read or audio is +played back. The value of *speed* specifies the head-speed of the +drive, measured in units of standard cdrom speed (176kB/sec raw data +or 150kB/sec file system data). So to request that a CD-ROM drive +operate at 300kB/sec you would call the CDROM_SELECT_SPEED *ioctl* +with *speed=2*. The special value `0` means `auto-selection`, i. e., +maximum data-rate or real-time audio rate. If the drive doesn't have +this `auto-selection` capability, the decision should be made on the +current disc loaded and the return value should be positive. A negative +return value indicates an error. + +:: + + int select_disc(struct cdrom_device_info *cdi, int number) + +If the drive can store multiple discs (a juke-box) this function +will perform disc selection. It should return the number of the +selected disc on success, a negative value on error. Currently, only +the ide-cd driver supports this functionality. + +:: + + int get_last_session(struct cdrom_device_info *cdi, + struct cdrom_multisession *ms_info) + +This function should implement the old corresponding *ioctl()*. For +device *cdi->dev*, the start of the last session of the current disc +should be returned in the pointer argument *ms_info*. Note that +routines in `cdrom.c` have sanitized this argument: its requested +format will **always** be of the type *CDROM_LBA* (linear block +addressing mode), whatever the calling software requested. But +sanitization goes even further: the low-level implementation may +return the requested information in *CDROM_MSF* format if it wishes so +(setting the *ms_info->addr_format* field appropriately, of +course) and the routines in `cdrom.c` will make the transformation if +necessary. The return value is 0 upon success. + +:: + + int get_mcn(struct cdrom_device_info *cdi, + struct cdrom_mcn *mcn) + +Some discs carry a `Media Catalog Number` (MCN), also called +`Universal Product Code` (UPC). This number should reflect the number +that is generally found in the bar-code on the product. Unfortunately, +the few discs that carry such a number on the disc don't even use the +same format. The return argument to this function is a pointer to a +pre-declared memory region of type *struct cdrom_mcn*. The MCN is +expected as a 13-character string, terminated by a null-character. + +:: + + int reset(struct cdrom_device_info *cdi) + +This call should perform a hard-reset on the drive (although in +circumstances that a hard-reset is necessary, a drive may very well not +listen to commands anymore). Preferably, control is returned to the +caller only after the drive has finished resetting. If the drive is no +longer listening, it may be wise for the underlying low-level cdrom +driver to time out. + +:: + + int audio_ioctl(struct cdrom_device_info *cdi, + unsigned int cmd, void *arg) + +Some of the CD-ROM-\ *ioctl()*\ 's defined in `cdrom.h` can be +implemented by the routines described above, and hence the function +*cdrom_ioctl* will use those. However, most *ioctl()*\ 's deal with +audio-control. We have decided to leave these to be accessed through a +single function, repeating the arguments *cmd* and *arg*. Note that +the latter is of type *void*, rather than *unsigned long int*. +The routine *cdrom_ioctl()* does do some useful things, +though. It sanitizes the address format type to *CDROM_MSF* (Minutes, +Seconds, Frames) for all audio calls. It also verifies the memory +location of *arg*, and reserves stack-memory for the argument. This +makes implementation of the *audio_ioctl()* much simpler than in the +old driver scheme. For example, you may look up the function +*cm206_audio_ioctl()* `cm206.c` that should be updated with +this documentation. + +An unimplemented ioctl should return *-ENOSYS*, but a harmless request +(e. g., *CDROMSTART*) may be ignored by returning 0 (success). Other +errors should be according to the standards, whatever they are. When +an error is returned by the low-level driver, the Uniform CD-ROM Driver +tries whenever possible to return the error code to the calling program. +(We may decide to sanitize the return value in *cdrom_ioctl()* though, in +order to guarantee a uniform interface to the audio-player software.) + +:: + + int dev_ioctl(struct cdrom_device_info *cdi, + unsigned int cmd, unsigned long arg) + +Some *ioctl()'s* seem to be specific to certain CD-ROM drives. That is, +they are introduced to service some capabilities of certain drives. In +fact, there are 6 different *ioctl()'s* for reading data, either in some +particular kind of format, or audio data. Not many drives support +reading audio tracks as data, I believe this is because of protection +of copyrights of artists. Moreover, I think that if audio-tracks are +supported, it should be done through the VFS and not via *ioctl()'s*. A +problem here could be the fact that audio-frames are 2352 bytes long, +so either the audio-file-system should ask for 75264 bytes at once +(the least common multiple of 512 and 2352), or the drivers should +bend their backs to cope with this incoherence (to which I would be +opposed). Furthermore, it is very difficult for the hardware to find +the exact frame boundaries, since there are no synchronization headers +in audio frames. Once these issues are resolved, this code should be +standardized in `cdrom.c`. + +Because there are so many *ioctl()'s* that seem to be introduced to +satisfy certain drivers [#f2]_, any non-standard *ioctl()*\ s +are routed through the call *dev_ioctl()*. In principle, `private` +*ioctl()*\ 's should be numbered after the device's major number, and not +the general CD-ROM *ioctl* number, `0x53`. Currently the +non-supported *ioctl()'s* are: + + CDROMREADMODE1, CDROMREADMODE2, CDROMREADAUDIO, CDROMREADRAW, + CDROMREADCOOKED, CDROMSEEK, CDROMPLAY-BLK and CDROM-READALL + +.. [#f2] + + Is there software around that actually uses these? I'd be interested! + +.. _cdrom_capabilities: + +CD-ROM capabilities +------------------- + +Instead of just implementing some *ioctl* calls, the interface in +`cdrom.c` supplies the possibility to indicate the **capabilities** +of a CD-ROM drive. This can be done by ORing any number of +capability-constants that are defined in `cdrom.h` at the registration +phase. Currently, the capabilities are any of:: + + CDC_CLOSE_TRAY /* can close tray by software control */ + CDC_OPEN_TRAY /* can open tray */ + CDC_LOCK /* can lock and unlock the door */ + CDC_SELECT_SPEED /* can select speed, in units of * sim*150 ,kB/s */ + CDC_SELECT_DISC /* drive is juke-box */ + CDC_MULTI_SESSION /* can read sessions *> rm1* */ + CDC_MCN /* can read Media Catalog Number */ + CDC_MEDIA_CHANGED /* can report if disc has changed */ + CDC_PLAY_AUDIO /* can perform audio-functions (play, pause, etc) */ + CDC_RESET /* hard reset device */ + CDC_IOCTLS /* driver has non-standard ioctls */ + CDC_DRIVE_STATUS /* driver implements drive status */ + +The capability flag is declared *const*, to prevent drivers from +accidentally tampering with the contents. The capability fags actually +inform `cdrom.c` of what the driver can do. If the drive found +by the driver does not have the capability, is can be masked out by +the *cdrom_device_info* variable *mask*. For instance, the SCSI CD-ROM +driver has implemented the code for loading and ejecting CD-ROM's, and +hence its corresponding flags in *capability* will be set. But a SCSI +CD-ROM drive might be a caddy system, which can't load the tray, and +hence for this drive the *cdrom_device_info* struct will have set +the *CDC_CLOSE_TRAY* bit in *mask*. + +In the file `cdrom.c` you will encounter many constructions of the type:: + + if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ... + +There is no *ioctl* to set the mask... The reason is that +I think it is better to control the **behavior** rather than the +**capabilities**. + +Options +------- + +A final flag register controls the **behavior** of the CD-ROM +drives, in order to satisfy different users' wishes, hopefully +independently of the ideas of the respective author who happened to +have made the drive's support available to the Linux community. The +current behavior options are:: + + CDO_AUTO_CLOSE /* try to close tray upon device open() */ + CDO_AUTO_EJECT /* try to open tray on last device close() */ + CDO_USE_FFLAGS /* use file_pointer->f_flags to indicate purpose for open() */ + CDO_LOCK /* try to lock door if device is opened */ + CDO_CHECK_TYPE /* ensure disc type is data if opened for data */ + +The initial value of this register is +`CDO_AUTO_CLOSE | CDO_USE_FFLAGS | CDO_LOCK`, reflecting my own view on user +interface and software standards. Before you protest, there are two +new *ioctl()'s* implemented in `cdrom.c`, that allow you to control the +behavior by software. These are:: + + CDROM_SET_OPTIONS /* set options specified in (int)arg */ + CDROM_CLEAR_OPTIONS /* clear options specified in (int)arg */ + +One option needs some more explanation: *CDO_USE_FFLAGS*. In the next +newsection we explain what the need for this option is. + +A software package `setcd`, available from the Debian distribution +and `sunsite.unc.edu`, allows user level control of these flags. + + +The need to know the purpose of opening the CD-ROM device +========================================================= + +Traditionally, Unix devices can be used in two different `modes`, +either by reading/writing to the device file, or by issuing +controlling commands to the device, by the device's *ioctl()* +call. The problem with CD-ROM drives, is that they can be used for +two entirely different purposes. One is to mount removable +file systems, CD-ROM's, the other is to play audio CD's. Audio commands +are implemented entirely through *ioctl()\'s*, presumably because the +first implementation (SUN?) has been such. In principle there is +nothing wrong with this, but a good control of the `CD player` demands +that the device can **always** be opened in order to give the +*ioctl* commands, regardless of the state the drive is in. + +On the other hand, when used as a removable-media disc drive (what the +original purpose of CD-ROM s is) we would like to make sure that the +disc drive is ready for operation upon opening the device. In the old +scheme, some CD-ROM drivers don't do any integrity checking, resulting +in a number of i/o errors reported by the VFS to the kernel when an +attempt for mounting a CD-ROM on an empty drive occurs. This is not a +particularly elegant way to find out that there is no CD-ROM inserted; +it more-or-less looks like the old IBM-PC trying to read an empty floppy +drive for a couple of seconds, after which the system complains it +can't read from it. Nowadays we can **sense** the existence of a +removable medium in a drive, and we believe we should exploit that +fact. An integrity check on opening of the device, that verifies the +availability of a CD-ROM and its correct type (data), would be +desirable. + +These two ways of using a CD-ROM drive, principally for data and +secondarily for playing audio discs, have different demands for the +behavior of the *open()* call. Audio use simply wants to open the +device in order to get a file handle which is needed for issuing +*ioctl* commands, while data use wants to open for correct and +reliable data transfer. The only way user programs can indicate what +their *purpose* of opening the device is, is through the *flags* +parameter (see `open(2)`). For CD-ROM devices, these flags aren't +implemented (some drivers implement checking for write-related flags, +but this is not strictly necessary if the device file has correct +permission flags). Most option flags simply don't make sense to +CD-ROM devices: *O_CREAT*, *O_NOCTTY*, *O_TRUNC*, *O_APPEND*, and +*O_SYNC* have no meaning to a CD-ROM. + +We therefore propose to use the flag *O_NONBLOCK* to indicate +that the device is opened just for issuing *ioctl* +commands. Strictly, the meaning of *O_NONBLOCK* is that opening and +subsequent calls to the device don't cause the calling process to +wait. We could interpret this as don't wait until someone has +inserted some valid data-CD-ROM. Thus, our proposal of the +implementation for the *open()* call for CD-ROM s is: + +- If no other flags are set than *O_RDONLY*, the device is opened + for data transfer, and the return value will be 0 only upon successful + initialization of the transfer. The call may even induce some actions + on the CD-ROM, such as closing the tray. +- If the option flag *O_NONBLOCK* is set, opening will always be + successful, unless the whole device doesn't exist. The drive will take + no actions whatsoever. + +And what about standards? +------------------------- + +You might hesitate to accept this proposal as it comes from the +Linux community, and not from some standardizing institute. What +about SUN, SGI, HP and all those other Unix and hardware vendors? +Well, these companies are in the lucky position that they generally +control both the hardware and software of their supported products, +and are large enough to set their own standard. They do not have to +deal with a dozen or more different, competing hardware +configurations\ [#f3]_. + +.. [#f3] + + Incidentally, I think that SUN's approach to mounting CD-ROM s is very + good in origin: under Solaris a volume-daemon automatically mounts a + newly inserted CD-ROM under `/cdrom/**`. + + In my opinion they should have pushed this + further and have **every** CD-ROM on the local area network be + mounted at the similar location, i. e., no matter in which particular + machine you insert a CD-ROM, it will always appear at the same + position in the directory tree, on every system. When I wanted to + implement such a user-program for Linux, I came across the + differences in behavior of the various drivers, and the need for an + *ioctl* informing about media changes. + +We believe that using *O_NONBLOCK* to indicate that a device is being opened +for *ioctl* commands only can be easily introduced in the Linux +community. All the CD-player authors will have to be informed, we can +even send in our own patches to the programs. The use of *O_NONBLOCK* +has most likely no influence on the behavior of the CD-players on +other operating systems than Linux. Finally, a user can always revert +to old behavior by a call to +*ioctl(file_descriptor, CDROM_CLEAR_OPTIONS, CDO_USE_FFLAGS)*. + +The preferred strategy of *open()* +---------------------------------- + +The routines in `cdrom.c` are designed in such a way that run-time +configuration of the behavior of CD-ROM devices (of **any** type) +can be carried out, by the *CDROM_SET/CLEAR_OPTIONS* *ioctls*. Thus, various +modes of operation can be set: + +`CDO_AUTO_CLOSE | CDO_USE_FFLAGS | CDO_LOCK` + This is the default setting. (With *CDO_CHECK_TYPE* it will be better, in + the future.) If the device is not yet opened by any other process, and if + the device is being opened for data (*O_NONBLOCK* is not set) and the + tray is found to be open, an attempt to close the tray is made. Then, + it is verified that a disc is in the drive and, if *CDO_CHECK_TYPE* is + set, that it contains tracks of type `data mode 1`. Only if all tests + are passed is the return value zero. The door is locked to prevent file + system corruption. If the drive is opened for audio (*O_NONBLOCK* is + set), no actions are taken and a value of 0 will be returned. + +`CDO_AUTO_CLOSE | CDO_AUTO_EJECT | CDO_LOCK` + This mimics the behavior of the current sbpcd-driver. The option flags are + ignored, the tray is closed on the first open, if necessary. Similarly, + the tray is opened on the last release, i. e., if a CD-ROM is unmounted, + it is automatically ejected, such that the user can replace it. + +We hope that these option can convince everybody (both driver +maintainers and user program developers) to adopt the new CD-ROM +driver scheme and option flag interpretation. + +Description of routines in `cdrom.c` +==================================== + +Only a few routines in `cdrom.c` are exported to the drivers. In this +new section we will discuss these, as well as the functions that `take +over' the CD-ROM interface to the kernel. The header file belonging +to `cdrom.c` is called `cdrom.h`. Formerly, some of the contents of this +file were placed in the file `ucdrom.h`, but this file has now been +merged back into `cdrom.h`. + +:: + + struct file_operations cdrom_fops + +The contents of this structure were described in cdrom_api_. +A pointer to this structure is assigned to the *fops* field +of the *struct gendisk*. + +:: + + int register_cdrom(struct cdrom_device_info *cdi) + +This function is used in about the same way one registers *cdrom_fops* +with the kernel, the device operations and information structures, +as described in cdrom_api_, should be registered with the +Uniform CD-ROM Driver:: + + register_cdrom(&_info); + + +This function returns zero upon success, and non-zero upon +failure. The structure *_info* should have a pointer to the +driver's *_dops*, as in:: + + struct cdrom_device_info _info = { + _dops; + ... + } + +Note that a driver must have one static structure, *_dops*, while +it may have as many structures *_info* as there are minor devices +active. *Register_cdrom()* builds a linked list from these. + + +:: + + void unregister_cdrom(struct cdrom_device_info *cdi) + +Unregistering device *cdi* with minor number *MINOR(cdi->dev)* removes +the minor device from the list. If it was the last registered minor for +the low-level driver, this disconnects the registered device-operation +routines from the CD-ROM interface. This function returns zero upon +success, and non-zero upon failure. + +:: + + int cdrom_open(struct inode * ip, struct file * fp) + +This function is not called directly by the low-level drivers, it is +listed in the standard *cdrom_fops*. If the VFS opens a file, this +function becomes active. A strategy is implemented in this routine, +taking care of all capabilities and options that are set in the +*cdrom_device_ops* connected to the device. Then, the program flow is +transferred to the device_dependent *open()* call. + +:: + + void cdrom_release(struct inode *ip, struct file *fp) + +This function implements the reverse-logic of *cdrom_open()*, and then +calls the device-dependent *release()* routine. When the use-count has +reached 0, the allocated buffers are flushed by calls to *sync_dev(dev)* +and *invalidate_buffers(dev)*. + + +.. _cdrom_ioctl: + +:: + + int cdrom_ioctl(struct inode *ip, struct file *fp, + unsigned int cmd, unsigned long arg) + +This function handles all the standard *ioctl* requests for CD-ROM +devices in a uniform way. The different calls fall into three +categories: *ioctl()'s* that can be directly implemented by device +operations, ones that are routed through the call *audio_ioctl()*, and +the remaining ones, that are presumable device-dependent. Generally, a +negative return value indicates an error. + +Directly implemented *ioctl()'s* +-------------------------------- + +The following `old` CD-ROM *ioctl()*\ 's are implemented by directly +calling device-operations in *cdrom_device_ops*, if implemented and +not masked: + +`CDROMMULTISESSION` + Requests the last session on a CD-ROM. +`CDROMEJECT` + Open tray. +`CDROMCLOSETRAY` + Close tray. +`CDROMEJECT_SW` + If *arg\not=0*, set behavior to auto-close (close + tray on first open) and auto-eject (eject on last release), otherwise + set behavior to non-moving on *open()* and *release()* calls. +`CDROM_GET_MCN` + Get the Media Catalog Number from a CD. + +*Ioctl*s routed through *audio_ioctl()* +--------------------------------------- + +The following set of *ioctl()'s* are all implemented through a call to +the *cdrom_fops* function *audio_ioctl()*. Memory checks and +allocation are performed in *cdrom_ioctl()*, and also sanitization of +address format (*CDROM_LBA*/*CDROM_MSF*) is done. + +`CDROMSUBCHNL` + Get sub-channel data in argument *arg* of type + `struct cdrom_subchnl *`. +`CDROMREADTOCHDR` + Read Table of Contents header, in *arg* of type + `struct cdrom_tochdr *`. +`CDROMREADTOCENTRY` + Read a Table of Contents entry in *arg* and specified by *arg* + of type `struct cdrom_tocentry *`. +`CDROMPLAYMSF` + Play audio fragment specified in Minute, Second, Frame format, + delimited by *arg* of type `struct cdrom_msf *`. +`CDROMPLAYTRKIND` + Play audio fragment in track-index format delimited by *arg* + of type `struct cdrom_ti *`. +`CDROMVOLCTRL` + Set volume specified by *arg* of type `struct cdrom_volctrl *`. +`CDROMVOLREAD` + Read volume into by *arg* of type `struct cdrom_volctrl *`. +`CDROMSTART` + Spin up disc. +`CDROMSTOP` + Stop playback of audio fragment. +`CDROMPAUSE` + Pause playback of audio fragment. +`CDROMRESUME` + Resume playing. + +New *ioctl()'s* in `cdrom.c` +---------------------------- + +The following *ioctl()'s* have been introduced to allow user programs to +control the behavior of individual CD-ROM devices. New *ioctl* +commands can be identified by the underscores in their names. + +`CDROM_SET_OPTIONS` + Set options specified by *arg*. Returns the option flag register + after modification. Use *arg = \rm0* for reading the current flags. +`CDROM_CLEAR_OPTIONS` + Clear options specified by *arg*. Returns the option flag register + after modification. +`CDROM_SELECT_SPEED` + Select head-rate speed of disc specified as by *arg* in units + of standard cdrom speed (176\,kB/sec raw data or + 150kB/sec file system data). The value 0 means `auto-select`, + i. e., play audio discs at real time and data discs at maximum speed. + The value *arg* is checked against the maximum head rate of the + drive found in the *cdrom_dops*. +`CDROM_SELECT_DISC` + Select disc numbered *arg* from a juke-box. + + First disc is numbered 0. The number *arg* is checked against the + maximum number of discs in the juke-box found in the *cdrom_dops*. +`CDROM_MEDIA_CHANGED` + Returns 1 if a disc has been changed since the last call. + Note that calls to *cdrom_media_changed* by the VFS are treated + by an independent queue, so both mechanisms will detect a + media change once. For juke-boxes, an extra argument *arg* + specifies the slot for which the information is given. The special + value *CDSL_CURRENT* requests that information about the currently + selected slot be returned. +`CDROM_DRIVE_STATUS` + Returns the status of the drive by a call to + *drive_status()*. Return values are defined in cdrom_drive_status_. + Note that this call doesn't return information on the + current playing activity of the drive; this can be polled through + an *ioctl* call to *CDROMSUBCHNL*. For juke-boxes, an extra argument + *arg* specifies the slot for which (possibly limited) information is + given. The special value *CDSL_CURRENT* requests that information + about the currently selected slot be returned. +`CDROM_DISC_STATUS` + Returns the type of the disc currently in the drive. + It should be viewed as a complement to *CDROM_DRIVE_STATUS*. + This *ioctl* can provide *some* information about the current + disc that is inserted in the drive. This functionality used to be + implemented in the low level drivers, but is now carried out + entirely in Uniform CD-ROM Driver. + + The history of development of the CD's use as a carrier medium for + various digital information has lead to many different disc types. + This *ioctl* is useful only in the case that CDs have \emph {only + one} type of data on them. While this is often the case, it is + also very common for CDs to have some tracks with data, and some + tracks with audio. Because this is an existing interface, rather + than fixing this interface by changing the assumptions it was made + under, thereby breaking all user applications that use this + function, the Uniform CD-ROM Driver implements this *ioctl* as + follows: If the CD in question has audio tracks on it, and it has + absolutely no CD-I, XA, or data tracks on it, it will be reported + as *CDS_AUDIO*. If it has both audio and data tracks, it will + return *CDS_MIXED*. If there are no audio tracks on the disc, and + if the CD in question has any CD-I tracks on it, it will be + reported as *CDS_XA_2_2*. Failing that, if the CD in question + has any XA tracks on it, it will be reported as *CDS_XA_2_1*. + Finally, if the CD in question has any data tracks on it, + it will be reported as a data CD (*CDS_DATA_1*). + + This *ioctl* can return:: + + CDS_NO_INFO /* no information available */ + CDS_NO_DISC /* no disc is inserted, or tray is opened */ + CDS_AUDIO /* Audio disc (2352 audio bytes/frame) */ + CDS_DATA_1 /* data disc, mode 1 (2048 user bytes/frame) */ + CDS_XA_2_1 /* mixed data (XA), mode 2, form 1 (2048 user bytes) */ + CDS_XA_2_2 /* mixed data (XA), mode 2, form 1 (2324 user bytes) */ + CDS_MIXED /* mixed audio/data disc */ + + For some information concerning frame layout of the various disc + types, see a recent version of `cdrom.h`. + +`CDROM_CHANGER_NSLOTS` + Returns the number of slots in a juke-box. +`CDROMRESET` + Reset the drive. +`CDROM_GET_CAPABILITY` + Returns the *capability* flags for the drive. Refer to section + cdrom_capabilities_ for more information on these flags. +`CDROM_LOCKDOOR` + Locks the door of the drive. `arg == 0` unlocks the door, + any other value locks it. +`CDROM_DEBUG` + Turns on debugging info. Only root is allowed to do this. + Same semantics as CDROM_LOCKDOOR. + + +Device dependent *ioctl()'s* +---------------------------- + +Finally, all other *ioctl()'s* are passed to the function *dev_ioctl()*, +if implemented. No memory allocation or verification is carried out. + +How to update your driver +========================= + +- Make a backup of your current driver. +- Get hold of the files `cdrom.c` and `cdrom.h`, they should be in + the directory tree that came with this documentation. +- Make sure you include `cdrom.h`. +- Change the 3rd argument of *register_blkdev* from `&_fops` + to `&cdrom_fops`. +- Just after that line, add the following to register with the Uniform + CD-ROM Driver:: + + register_cdrom(&_info);* + + Similarly, add a call to *unregister_cdrom()* at the appropriate place. +- Copy an example of the device-operations *struct* to your + source, e. g., from `cm206.c` *cm206_dops*, and change all + entries to names corresponding to your driver, or names you just + happen to like. If your driver doesn't support a certain function, + make the entry *NULL*. At the entry *capability* you should list all + capabilities your driver currently supports. If your driver + has a capability that is not listed, please send me a message. +- Copy the *cdrom_device_info* declaration from the same example + driver, and modify the entries according to your needs. If your + driver dynamically determines the capabilities of the hardware, this + structure should also be declared dynamically. +- Implement all functions in your `_dops` structure, + according to prototypes listed in `cdrom.h`, and specifications given + in cdrom_api_. Most likely you have already implemented + the code in a large part, and you will almost certainly need to adapt the + prototype and return values. +- Rename your `_ioctl()` function to *audio_ioctl* and + change the prototype a little. Remove entries listed in the first + part in cdrom_ioctl_, if your code was OK, these are + just calls to the routines you adapted in the previous step. +- You may remove all remaining memory checking code in the + *audio_ioctl()* function that deals with audio commands (these are + listed in the second part of cdrom_ioctl_. There is no + need for memory allocation either, so most *case*s in the *switch* + statement look similar to:: + + case CDROMREADTOCENTRY: + get_toc_entry\bigl((struct cdrom_tocentry *) arg); + +- All remaining *ioctl* cases must be moved to a separate + function, *_ioctl*, the device-dependent *ioctl()'s*. Note that + memory checking and allocation must be kept in this code! +- Change the prototypes of *_open()* and + *_release()*, and remove any strategic code (i. e., tray + movement, door locking, etc.). +- Try to recompile the drivers. We advise you to use modules, both + for `cdrom.o` and your driver, as debugging is much easier this + way. + +Thanks +====== + +Thanks to all the people involved. First, Erik Andersen, who has +taken over the torch in maintaining `cdrom.c` and integrating much +CD-ROM-related code in the 2.1-kernel. Thanks to Scott Snyder and +Gerd Knorr, who were the first to implement this interface for SCSI +and IDE-CD drivers and added many ideas for extension of the data +structures relative to kernel~2.0. Further thanks to Heiko Eißfeldt, +Thomas Quinot, Jon Tombs, Ken Pizzini, Eberhard Mönkeberg and Andrew Kroll, +the Linux CD-ROM device driver developers who were kind +enough to give suggestions and criticisms during the writing. Finally +of course, I want to thank Linus Torvalds for making this possible in +the first place. diff --git a/Documentation/cdrom/cdrom-standard.txt b/Documentation/cdrom/cdrom-standard.txt deleted file mode 100644 index dde4f7f7fdbf..000000000000 --- a/Documentation/cdrom/cdrom-standard.txt +++ /dev/null @@ -1,1063 +0,0 @@ -======================= -A Linux CD-ROM standard -======================= - -:Author: David van Leeuwen -:Date: 12 March 1999 -:Updated by: Erik Andersen (andersee@debian.org) -:Updated by: Jens Axboe (axboe@image.dk) - - -Introduction -============ - -Linux is probably the Unix-like operating system that supports -the widest variety of hardware devices. The reasons for this are -presumably - -- The large list of hardware devices available for the many platforms - that Linux now supports (i.e., i386-PCs, Sparc Suns, etc.) -- The open design of the operating system, such that anybody can write a - driver for Linux. -- There is plenty of source code around as examples of how to write a driver. - -The openness of Linux, and the many different types of available -hardware has allowed Linux to support many different hardware devices. -Unfortunately, the very openness that has allowed Linux to support -all these different devices has also allowed the behavior of each -device driver to differ significantly from one device to another. -This divergence of behavior has been very significant for CD-ROM -devices; the way a particular drive reacts to a `standard` *ioctl()* -call varies greatly from one device driver to another. To avoid making -their drivers totally inconsistent, the writers of Linux CD-ROM -drivers generally created new device drivers by understanding, copying, -and then changing an existing one. Unfortunately, this practice did not -maintain uniform behavior across all the Linux CD-ROM drivers. - -This document describes an effort to establish Uniform behavior across -all the different CD-ROM device drivers for Linux. This document also -defines the various *ioctl()'s*, and how the low-level CD-ROM device -drivers should implement them. Currently (as of the Linux 2.1.\ *x* -development kernels) several low-level CD-ROM device drivers, including -both IDE/ATAPI and SCSI, now use this Uniform interface. - -When the CD-ROM was developed, the interface between the CD-ROM drive -and the computer was not specified in the standards. As a result, many -different CD-ROM interfaces were developed. Some of them had their -own proprietary design (Sony, Mitsumi, Panasonic, Philips), other -manufacturers adopted an existing electrical interface and changed -the functionality (CreativeLabs/SoundBlaster, Teac, Funai) or simply -adapted their drives to one or more of the already existing electrical -interfaces (Aztech, Sanyo, Funai, Vertos, Longshine, Optics Storage and -most of the `NoName` manufacturers). In cases where a new drive really -brought its own interface or used its own command set and flow control -scheme, either a separate driver had to be written, or an existing -driver had to be enhanced. History has delivered us CD-ROM support for -many of these different interfaces. Nowadays, almost all new CD-ROM -drives are either IDE/ATAPI or SCSI, and it is very unlikely that any -manufacturer will create a new interface. Even finding drives for the -old proprietary interfaces is getting difficult. - -When (in the 1.3.70's) I looked at the existing software interface, -which was expressed through `cdrom.h`, it appeared to be a rather wild -set of commands and data formats [#f1]_. It seemed that many -features of the software interface had been added to accommodate the -capabilities of a particular drive, in an *ad hoc* manner. More -importantly, it appeared that the behavior of the `standard` commands -was different for most of the different drivers: e. g., some drivers -close the tray if an *open()* call occurs when the tray is open, while -others do not. Some drivers lock the door upon opening the device, to -prevent an incoherent file system, but others don't, to allow software -ejection. Undoubtedly, the capabilities of the different drives vary, -but even when two drives have the same capability their drivers' -behavior was usually different. - -.. [#f1] - I cannot recollect what kernel version I looked at, then, - presumably 1.2.13 and 1.3.34 --- the latest kernel that I was - indirectly involved in. - -I decided to start a discussion on how to make all the Linux CD-ROM -drivers behave more uniformly. I began by contacting the developers of -the many CD-ROM drivers found in the Linux kernel. Their reactions -encouraged me to write the Uniform CD-ROM Driver which this document is -intended to describe. The implementation of the Uniform CD-ROM Driver is -in the file `cdrom.c`. This driver is intended to be an additional software -layer that sits on top of the low-level device drivers for each CD-ROM drive. -By adding this additional layer, it is possible to have all the different -CD-ROM devices behave **exactly** the same (insofar as the underlying -hardware will allow). - -The goal of the Uniform CD-ROM Driver is **not** to alienate driver developers -whohave not yet taken steps to support this effort. The goal of Uniform CD-ROM -Driver is simply to give people writing application programs for CD-ROM drives -**one** Linux CD-ROM interface with consistent behavior for all -CD-ROM devices. In addition, this also provides a consistent interface -between the low-level device driver code and the Linux kernel. Care -is taken that 100% compatibility exists with the data structures and -programmer's interface defined in `cdrom.h`. This guide was written to -help CD-ROM driver developers adapt their code to use the Uniform CD-ROM -Driver code defined in `cdrom.c`. - -Personally, I think that the most important hardware interfaces are -the IDE/ATAPI drives and, of course, the SCSI drives, but as prices -of hardware drop continuously, it is also likely that people may have -more than one CD-ROM drive, possibly of mixed types. It is important -that these drives behave in the same way. In December 1994, one of the -cheapest CD-ROM drives was a Philips cm206, a double-speed proprietary -drive. In the months that I was busy writing a Linux driver for it, -proprietary drives became obsolete and IDE/ATAPI drives became the -standard. At the time of the last update to this document (November -1997) it is becoming difficult to even **find** anything less than a -16 speed CD-ROM drive, and 24 speed drives are common. - -.. _cdrom_api: - -Standardizing through another software level -============================================ - -At the time this document was conceived, all drivers directly -implemented the CD-ROM *ioctl()* calls through their own routines. This -led to the danger of different drivers forgetting to do important things -like checking that the user was giving the driver valid data. More -importantly, this led to the divergence of behavior, which has already -been discussed. - -For this reason, the Uniform CD-ROM Driver was created to enforce consistent -CD-ROM drive behavior, and to provide a common set of services to the various -low-level CD-ROM device drivers. The Uniform CD-ROM Driver now provides another -software-level, that separates the *ioctl()* and *open()* implementation -from the actual hardware implementation. Note that this effort has -made few changes which will affect a user's application programs. The -greatest change involved moving the contents of the various low-level -CD-ROM drivers\' header files to the kernel's cdrom directory. This was -done to help ensure that the user is only presented with only one cdrom -interface, the interface defined in `cdrom.h`. - -CD-ROM drives are specific enough (i. e., different from other -block-devices such as floppy or hard disc drives), to define a set -of common **CD-ROM device operations**, *_dops*. -These operations are different from the classical block-device file -operations, *_fops*. - -The routines for the Uniform CD-ROM Driver interface level are implemented -in the file `cdrom.c`. In this file, the Uniform CD-ROM Driver interfaces -with the kernel as a block device by registering the following general -*struct file_operations*:: - - struct file_operations cdrom_fops = { - NULL, /∗ lseek ∗/ - block _read , /∗ read—general block-dev read ∗/ - block _write, /∗ write—general block-dev write ∗/ - NULL, /∗ readdir ∗/ - NULL, /∗ select ∗/ - cdrom_ioctl, /∗ ioctl ∗/ - NULL, /∗ mmap ∗/ - cdrom_open, /∗ open ∗/ - cdrom_release, /∗ release ∗/ - NULL, /∗ fsync ∗/ - NULL, /∗ fasync ∗/ - cdrom_media_changed, /∗ media change ∗/ - NULL /∗ revalidate ∗/ - }; - -Every active CD-ROM device shares this *struct*. The routines -declared above are all implemented in `cdrom.c`, since this file is the -place where the behavior of all CD-ROM-devices is defined and -standardized. The actual interface to the various types of CD-ROM -hardware is still performed by various low-level CD-ROM-device -drivers. These routines simply implement certain **capabilities** -that are common to all CD-ROM (and really, all removable-media -devices). - -Registration of a low-level CD-ROM device driver is now done through -the general routines in `cdrom.c`, not through the Virtual File System -(VFS) any more. The interface implemented in `cdrom.c` is carried out -through two general structures that contain information about the -capabilities of the driver, and the specific drives on which the -driver operates. The structures are: - -cdrom_device_ops - This structure contains information about the low-level driver for a - CD-ROM device. This structure is conceptually connected to the major - number of the device (although some drivers may have different - major numbers, as is the case for the IDE driver). - -cdrom_device_info - This structure contains information about a particular CD-ROM drive, - such as its device name, speed, etc. This structure is conceptually - connected to the minor number of the device. - -Registering a particular CD-ROM drive with the Uniform CD-ROM Driver -is done by the low-level device driver though a call to:: - - register_cdrom(struct cdrom_device_info * _info) - -The device information structure, *_info*, contains all the -information needed for the kernel to interface with the low-level -CD-ROM device driver. One of the most important entries in this -structure is a pointer to the *cdrom_device_ops* structure of the -low-level driver. - -The device operations structure, *cdrom_device_ops*, contains a list -of pointers to the functions which are implemented in the low-level -device driver. When `cdrom.c` accesses a CD-ROM device, it does it -through the functions in this structure. It is impossible to know all -the capabilities of future CD-ROM drives, so it is expected that this -list may need to be expanded from time to time as new technologies are -developed. For example, CD-R and CD-R/W drives are beginning to become -popular, and support will soon need to be added for them. For now, the -current *struct* is:: - - struct cdrom_device_ops { - int (*open)(struct cdrom_device_info *, int) - void (*release)(struct cdrom_device_info *); - int (*drive_status)(struct cdrom_device_info *, int); - unsigned int (*check_events)(struct cdrom_device_info *, - unsigned int, int); - int (*media_changed)(struct cdrom_device_info *, int); - int (*tray_move)(struct cdrom_device_info *, int); - int (*lock_door)(struct cdrom_device_info *, int); - int (*select_speed)(struct cdrom_device_info *, int); - int (*select_disc)(struct cdrom_device_info *, int); - int (*get_last_session) (struct cdrom_device_info *, - struct cdrom_multisession *); - int (*get_mcn)(struct cdrom_device_info *, struct cdrom_mcn *); - int (*reset)(struct cdrom_device_info *); - int (*audio_ioctl)(struct cdrom_device_info *, - unsigned int, void *); - const int capability; /* capability flags */ - int (*generic_packet)(struct cdrom_device_info *, - struct packet_command *); - }; - -When a low-level device driver implements one of these capabilities, -it should add a function pointer to this *struct*. When a particular -function is not implemented, however, this *struct* should contain a -NULL instead. The *capability* flags specify the capabilities of the -CD-ROM hardware and/or low-level CD-ROM driver when a CD-ROM drive -is registered with the Uniform CD-ROM Driver. - -Note that most functions have fewer parameters than their -*blkdev_fops* counterparts. This is because very little of the -information in the structures *inode* and *file* is used. For most -drivers, the main parameter is the *struct* *cdrom_device_info*, from -which the major and minor number can be extracted. (Most low-level -CD-ROM drivers don't even look at the major and minor number though, -since many of them only support one device.) This will be available -through *dev* in *cdrom_device_info* described below. - -The drive-specific, minor-like information that is registered with -`cdrom.c`, currently contains the following fields:: - - struct cdrom_device_info { - const struct cdrom_device_ops * ops; /* device operations for this major */ - struct list_head list; /* linked list of all device_info */ - struct gendisk * disk; /* matching block layer disk */ - void * handle; /* driver-dependent data */ - - int mask; /* mask of capability: disables them */ - int speed; /* maximum speed for reading data */ - int capacity; /* number of discs in a jukebox */ - - unsigned int options:30; /* options flags */ - unsigned mc_flags:2; /* media-change buffer flags */ - unsigned int vfs_events; /* cached events for vfs path */ - unsigned int ioctl_events; /* cached events for ioctl path */ - int use_count; /* number of times device is opened */ - char name[20]; /* name of the device type */ - - __u8 sanyo_slot : 2; /* Sanyo 3-CD changer support */ - __u8 keeplocked : 1; /* CDROM_LOCKDOOR status */ - __u8 reserved : 5; /* not used yet */ - int cdda_method; /* see CDDA_* flags */ - __u8 last_sense; /* saves last sense key */ - __u8 media_written; /* dirty flag, DVD+RW bookkeeping */ - unsigned short mmc3_profile; /* current MMC3 profile */ - int for_data; /* unknown:TBD */ - int (*exit)(struct cdrom_device_info *);/* unknown:TBD */ - int mrw_mode_page; /* which MRW mode page is in use */ - }; - -Using this *struct*, a linked list of the registered minor devices is -built, using the *next* field. The device number, the device operations -struct and specifications of properties of the drive are stored in this -structure. - -The *mask* flags can be used to mask out some of the capabilities listed -in *ops->capability*, if a specific drive doesn't support a feature -of the driver. The value *speed* specifies the maximum head-rate of the -drive, measured in units of normal audio speed (176kB/sec raw data or -150kB/sec file system data). The parameters are declared *const* -because they describe properties of the drive, which don't change after -registration. - -A few registers contain variables local to the CD-ROM drive. The -flags *options* are used to specify how the general CD-ROM routines -should behave. These various flags registers should provide enough -flexibility to adapt to the different users' wishes (and **not** the -`arbitrary` wishes of the author of the low-level device driver, as is -the case in the old scheme). The register *mc_flags* is used to buffer -the information from *media_changed()* to two separate queues. Other -data that is specific to a minor drive, can be accessed through *handle*, -which can point to a data structure specific to the low-level driver. -The fields *use_count*, *next*, *options* and *mc_flags* need not be -initialized. - -The intermediate software layer that `cdrom.c` forms will perform some -additional bookkeeping. The use count of the device (the number of -processes that have the device opened) is registered in *use_count*. The -function *cdrom_ioctl()* will verify the appropriate user-memory regions -for read and write, and in case a location on the CD is transferred, -it will `sanitize` the format by making requests to the low-level -drivers in a standard format, and translating all formats between the -user-software and low level drivers. This relieves much of the drivers' -memory checking and format checking and translation. Also, the necessary -structures will be declared on the program stack. - -The implementation of the functions should be as defined in the -following sections. Two functions **must** be implemented, namely -*open()* and *release()*. Other functions may be omitted, their -corresponding capability flags will be cleared upon registration. -Generally, a function returns zero on success and negative on error. A -function call should return only after the command has completed, but of -course waiting for the device should not use processor time. - -:: - - int open(struct cdrom_device_info *cdi, int purpose) - -*Open()* should try to open the device for a specific *purpose*, which -can be either: - -- Open for reading data, as done by `mount()` (2), or the - user commands `dd` or `cat`. -- Open for *ioctl* commands, as done by audio-CD playing programs. - -Notice that any strategic code (closing tray upon *open()*, etc.) is -done by the calling routine in `cdrom.c`, so the low-level routine -should only be concerned with proper initialization, such as spinning -up the disc, etc. - -:: - - void release(struct cdrom_device_info *cdi) - -Device-specific actions should be taken such as spinning down the device. -However, strategic actions such as ejection of the tray, or unlocking -the door, should be left over to the general routine *cdrom_release()*. -This is the only function returning type *void*. - -.. _cdrom_drive_status: - -:: - - int drive_status(struct cdrom_device_info *cdi, int slot_nr) - -The function *drive_status*, if implemented, should provide -information on the status of the drive (not the status of the disc, -which may or may not be in the drive). If the drive is not a changer, -*slot_nr* should be ignored. In `cdrom.h` the possibilities are listed:: - - - CDS_NO_INFO /* no information available */ - CDS_NO_DISC /* no disc is inserted, tray is closed */ - CDS_TRAY_OPEN /* tray is opened */ - CDS_DRIVE_NOT_READY /* something is wrong, tray is moving? */ - CDS_DISC_OK /* a disc is loaded and everything is fine */ - -:: - - int media_changed(struct cdrom_device_info *cdi, int disc_nr) - -This function is very similar to the original function in $struct -file_operations*. It returns 1 if the medium of the device *cdi->dev* -has changed since the last call, and 0 otherwise. The parameter -*disc_nr* identifies a specific slot in a juke-box, it should be -ignored for single-disc drives. Note that by `re-routing` this -function through *cdrom_media_changed()*, we can implement separate -queues for the VFS and a new *ioctl()* function that can report device -changes to software (e. g., an auto-mounting daemon). - -:: - - int tray_move(struct cdrom_device_info *cdi, int position) - -This function, if implemented, should control the tray movement. (No -other function should control this.) The parameter *position* controls -the desired direction of movement: - -- 0 Close tray -- 1 Open tray - -This function returns 0 upon success, and a non-zero value upon -error. Note that if the tray is already in the desired position, no -action need be taken, and the return value should be 0. - -:: - - int lock_door(struct cdrom_device_info *cdi, int lock) - -This function (and no other code) controls locking of the door, if the -drive allows this. The value of *lock* controls the desired locking -state: - -- 0 Unlock door, manual opening is allowed -- 1 Lock door, tray cannot be ejected manually - -This function returns 0 upon success, and a non-zero value upon -error. Note that if the door is already in the requested state, no -action need be taken, and the return value should be 0. - -:: - - int select_speed(struct cdrom_device_info *cdi, int speed) - -Some CD-ROM drives are capable of changing their head-speed. There -are several reasons for changing the speed of a CD-ROM drive. Badly -pressed CD-ROM s may benefit from less-than-maximum head rate. Modern -CD-ROM drives can obtain very high head rates (up to *24x* is -common). It has been reported that these drives can make reading -errors at these high speeds, reducing the speed can prevent data loss -in these circumstances. Finally, some of these drives can -make an annoyingly loud noise, which a lower speed may reduce. - -This function specifies the speed at which data is read or audio is -played back. The value of *speed* specifies the head-speed of the -drive, measured in units of standard cdrom speed (176kB/sec raw data -or 150kB/sec file system data). So to request that a CD-ROM drive -operate at 300kB/sec you would call the CDROM_SELECT_SPEED *ioctl* -with *speed=2*. The special value `0` means `auto-selection`, i. e., -maximum data-rate or real-time audio rate. If the drive doesn't have -this `auto-selection` capability, the decision should be made on the -current disc loaded and the return value should be positive. A negative -return value indicates an error. - -:: - - int select_disc(struct cdrom_device_info *cdi, int number) - -If the drive can store multiple discs (a juke-box) this function -will perform disc selection. It should return the number of the -selected disc on success, a negative value on error. Currently, only -the ide-cd driver supports this functionality. - -:: - - int get_last_session(struct cdrom_device_info *cdi, - struct cdrom_multisession *ms_info) - -This function should implement the old corresponding *ioctl()*. For -device *cdi->dev*, the start of the last session of the current disc -should be returned in the pointer argument *ms_info*. Note that -routines in `cdrom.c` have sanitized this argument: its requested -format will **always** be of the type *CDROM_LBA* (linear block -addressing mode), whatever the calling software requested. But -sanitization goes even further: the low-level implementation may -return the requested information in *CDROM_MSF* format if it wishes so -(setting the *ms_info->addr_format* field appropriately, of -course) and the routines in `cdrom.c` will make the transformation if -necessary. The return value is 0 upon success. - -:: - - int get_mcn(struct cdrom_device_info *cdi, - struct cdrom_mcn *mcn) - -Some discs carry a `Media Catalog Number` (MCN), also called -`Universal Product Code` (UPC). This number should reflect the number -that is generally found in the bar-code on the product. Unfortunately, -the few discs that carry such a number on the disc don't even use the -same format. The return argument to this function is a pointer to a -pre-declared memory region of type *struct cdrom_mcn*. The MCN is -expected as a 13-character string, terminated by a null-character. - -:: - - int reset(struct cdrom_device_info *cdi) - -This call should perform a hard-reset on the drive (although in -circumstances that a hard-reset is necessary, a drive may very well not -listen to commands anymore). Preferably, control is returned to the -caller only after the drive has finished resetting. If the drive is no -longer listening, it may be wise for the underlying low-level cdrom -driver to time out. - -:: - - int audio_ioctl(struct cdrom_device_info *cdi, - unsigned int cmd, void *arg) - -Some of the CD-ROM-\ *ioctl()*\ 's defined in `cdrom.h` can be -implemented by the routines described above, and hence the function -*cdrom_ioctl* will use those. However, most *ioctl()*\ 's deal with -audio-control. We have decided to leave these to be accessed through a -single function, repeating the arguments *cmd* and *arg*. Note that -the latter is of type *void*, rather than *unsigned long int*. -The routine *cdrom_ioctl()* does do some useful things, -though. It sanitizes the address format type to *CDROM_MSF* (Minutes, -Seconds, Frames) for all audio calls. It also verifies the memory -location of *arg*, and reserves stack-memory for the argument. This -makes implementation of the *audio_ioctl()* much simpler than in the -old driver scheme. For example, you may look up the function -*cm206_audio_ioctl()* `cm206.c` that should be updated with -this documentation. - -An unimplemented ioctl should return *-ENOSYS*, but a harmless request -(e. g., *CDROMSTART*) may be ignored by returning 0 (success). Other -errors should be according to the standards, whatever they are. When -an error is returned by the low-level driver, the Uniform CD-ROM Driver -tries whenever possible to return the error code to the calling program. -(We may decide to sanitize the return value in *cdrom_ioctl()* though, in -order to guarantee a uniform interface to the audio-player software.) - -:: - - int dev_ioctl(struct cdrom_device_info *cdi, - unsigned int cmd, unsigned long arg) - -Some *ioctl()'s* seem to be specific to certain CD-ROM drives. That is, -they are introduced to service some capabilities of certain drives. In -fact, there are 6 different *ioctl()'s* for reading data, either in some -particular kind of format, or audio data. Not many drives support -reading audio tracks as data, I believe this is because of protection -of copyrights of artists. Moreover, I think that if audio-tracks are -supported, it should be done through the VFS and not via *ioctl()'s*. A -problem here could be the fact that audio-frames are 2352 bytes long, -so either the audio-file-system should ask for 75264 bytes at once -(the least common multiple of 512 and 2352), or the drivers should -bend their backs to cope with this incoherence (to which I would be -opposed). Furthermore, it is very difficult for the hardware to find -the exact frame boundaries, since there are no synchronization headers -in audio frames. Once these issues are resolved, this code should be -standardized in `cdrom.c`. - -Because there are so many *ioctl()'s* that seem to be introduced to -satisfy certain drivers [#f2]_, any non-standard *ioctl()*\ s -are routed through the call *dev_ioctl()*. In principle, `private` -*ioctl()*\ 's should be numbered after the device's major number, and not -the general CD-ROM *ioctl* number, `0x53`. Currently the -non-supported *ioctl()'s* are: - - CDROMREADMODE1, CDROMREADMODE2, CDROMREADAUDIO, CDROMREADRAW, - CDROMREADCOOKED, CDROMSEEK, CDROMPLAY-BLK and CDROM-READALL - -.. [#f2] - - Is there software around that actually uses these? I'd be interested! - -.. _cdrom_capabilities: - -CD-ROM capabilities -------------------- - -Instead of just implementing some *ioctl* calls, the interface in -`cdrom.c` supplies the possibility to indicate the **capabilities** -of a CD-ROM drive. This can be done by ORing any number of -capability-constants that are defined in `cdrom.h` at the registration -phase. Currently, the capabilities are any of:: - - CDC_CLOSE_TRAY /* can close tray by software control */ - CDC_OPEN_TRAY /* can open tray */ - CDC_LOCK /* can lock and unlock the door */ - CDC_SELECT_SPEED /* can select speed, in units of * sim*150 ,kB/s */ - CDC_SELECT_DISC /* drive is juke-box */ - CDC_MULTI_SESSION /* can read sessions *> rm1* */ - CDC_MCN /* can read Media Catalog Number */ - CDC_MEDIA_CHANGED /* can report if disc has changed */ - CDC_PLAY_AUDIO /* can perform audio-functions (play, pause, etc) */ - CDC_RESET /* hard reset device */ - CDC_IOCTLS /* driver has non-standard ioctls */ - CDC_DRIVE_STATUS /* driver implements drive status */ - -The capability flag is declared *const*, to prevent drivers from -accidentally tampering with the contents. The capability fags actually -inform `cdrom.c` of what the driver can do. If the drive found -by the driver does not have the capability, is can be masked out by -the *cdrom_device_info* variable *mask*. For instance, the SCSI CD-ROM -driver has implemented the code for loading and ejecting CD-ROM's, and -hence its corresponding flags in *capability* will be set. But a SCSI -CD-ROM drive might be a caddy system, which can't load the tray, and -hence for this drive the *cdrom_device_info* struct will have set -the *CDC_CLOSE_TRAY* bit in *mask*. - -In the file `cdrom.c` you will encounter many constructions of the type:: - - if (cdo->capability & ∼cdi->mask & CDC _⟨capability⟩) ... - -There is no *ioctl* to set the mask... The reason is that -I think it is better to control the **behavior** rather than the -**capabilities**. - -Options -------- - -A final flag register controls the **behavior** of the CD-ROM -drives, in order to satisfy different users' wishes, hopefully -independently of the ideas of the respective author who happened to -have made the drive's support available to the Linux community. The -current behavior options are:: - - CDO_AUTO_CLOSE /* try to close tray upon device open() */ - CDO_AUTO_EJECT /* try to open tray on last device close() */ - CDO_USE_FFLAGS /* use file_pointer->f_flags to indicate purpose for open() */ - CDO_LOCK /* try to lock door if device is opened */ - CDO_CHECK_TYPE /* ensure disc type is data if opened for data */ - -The initial value of this register is -`CDO_AUTO_CLOSE | CDO_USE_FFLAGS | CDO_LOCK`, reflecting my own view on user -interface and software standards. Before you protest, there are two -new *ioctl()'s* implemented in `cdrom.c`, that allow you to control the -behavior by software. These are:: - - CDROM_SET_OPTIONS /* set options specified in (int)arg */ - CDROM_CLEAR_OPTIONS /* clear options specified in (int)arg */ - -One option needs some more explanation: *CDO_USE_FFLAGS*. In the next -newsection we explain what the need for this option is. - -A software package `setcd`, available from the Debian distribution -and `sunsite.unc.edu`, allows user level control of these flags. - - -The need to know the purpose of opening the CD-ROM device -========================================================= - -Traditionally, Unix devices can be used in two different `modes`, -either by reading/writing to the device file, or by issuing -controlling commands to the device, by the device's *ioctl()* -call. The problem with CD-ROM drives, is that they can be used for -two entirely different purposes. One is to mount removable -file systems, CD-ROM's, the other is to play audio CD's. Audio commands -are implemented entirely through *ioctl()\'s*, presumably because the -first implementation (SUN?) has been such. In principle there is -nothing wrong with this, but a good control of the `CD player` demands -that the device can **always** be opened in order to give the -*ioctl* commands, regardless of the state the drive is in. - -On the other hand, when used as a removable-media disc drive (what the -original purpose of CD-ROM s is) we would like to make sure that the -disc drive is ready for operation upon opening the device. In the old -scheme, some CD-ROM drivers don't do any integrity checking, resulting -in a number of i/o errors reported by the VFS to the kernel when an -attempt for mounting a CD-ROM on an empty drive occurs. This is not a -particularly elegant way to find out that there is no CD-ROM inserted; -it more-or-less looks like the old IBM-PC trying to read an empty floppy -drive for a couple of seconds, after which the system complains it -can't read from it. Nowadays we can **sense** the existence of a -removable medium in a drive, and we believe we should exploit that -fact. An integrity check on opening of the device, that verifies the -availability of a CD-ROM and its correct type (data), would be -desirable. - -These two ways of using a CD-ROM drive, principally for data and -secondarily for playing audio discs, have different demands for the -behavior of the *open()* call. Audio use simply wants to open the -device in order to get a file handle which is needed for issuing -*ioctl* commands, while data use wants to open for correct and -reliable data transfer. The only way user programs can indicate what -their *purpose* of opening the device is, is through the *flags* -parameter (see `open(2)`). For CD-ROM devices, these flags aren't -implemented (some drivers implement checking for write-related flags, -but this is not strictly necessary if the device file has correct -permission flags). Most option flags simply don't make sense to -CD-ROM devices: *O_CREAT*, *O_NOCTTY*, *O_TRUNC*, *O_APPEND*, and -*O_SYNC* have no meaning to a CD-ROM. - -We therefore propose to use the flag *O_NONBLOCK* to indicate -that the device is opened just for issuing *ioctl* -commands. Strictly, the meaning of *O_NONBLOCK* is that opening and -subsequent calls to the device don't cause the calling process to -wait. We could interpret this as don't wait until someone has -inserted some valid data-CD-ROM. Thus, our proposal of the -implementation for the *open()* call for CD-ROM s is: - -- If no other flags are set than *O_RDONLY*, the device is opened - for data transfer, and the return value will be 0 only upon successful - initialization of the transfer. The call may even induce some actions - on the CD-ROM, such as closing the tray. -- If the option flag *O_NONBLOCK* is set, opening will always be - successful, unless the whole device doesn't exist. The drive will take - no actions whatsoever. - -And what about standards? -------------------------- - -You might hesitate to accept this proposal as it comes from the -Linux community, and not from some standardizing institute. What -about SUN, SGI, HP and all those other Unix and hardware vendors? -Well, these companies are in the lucky position that they generally -control both the hardware and software of their supported products, -and are large enough to set their own standard. They do not have to -deal with a dozen or more different, competing hardware -configurations\ [#f3]_. - -.. [#f3] - - Incidentally, I think that SUN's approach to mounting CD-ROM s is very - good in origin: under Solaris a volume-daemon automatically mounts a - newly inserted CD-ROM under `/cdrom/**`. - - In my opinion they should have pushed this - further and have **every** CD-ROM on the local area network be - mounted at the similar location, i. e., no matter in which particular - machine you insert a CD-ROM, it will always appear at the same - position in the directory tree, on every system. When I wanted to - implement such a user-program for Linux, I came across the - differences in behavior of the various drivers, and the need for an - *ioctl* informing about media changes. - -We believe that using *O_NONBLOCK* to indicate that a device is being opened -for *ioctl* commands only can be easily introduced in the Linux -community. All the CD-player authors will have to be informed, we can -even send in our own patches to the programs. The use of *O_NONBLOCK* -has most likely no influence on the behavior of the CD-players on -other operating systems than Linux. Finally, a user can always revert -to old behavior by a call to -*ioctl(file_descriptor, CDROM_CLEAR_OPTIONS, CDO_USE_FFLAGS)*. - -The preferred strategy of *open()* ----------------------------------- - -The routines in `cdrom.c` are designed in such a way that run-time -configuration of the behavior of CD-ROM devices (of **any** type) -can be carried out, by the *CDROM_SET/CLEAR_OPTIONS* *ioctls*. Thus, various -modes of operation can be set: - -`CDO_AUTO_CLOSE | CDO_USE_FFLAGS | CDO_LOCK` - This is the default setting. (With *CDO_CHECK_TYPE* it will be better, in - the future.) If the device is not yet opened by any other process, and if - the device is being opened for data (*O_NONBLOCK* is not set) and the - tray is found to be open, an attempt to close the tray is made. Then, - it is verified that a disc is in the drive and, if *CDO_CHECK_TYPE* is - set, that it contains tracks of type `data mode 1`. Only if all tests - are passed is the return value zero. The door is locked to prevent file - system corruption. If the drive is opened for audio (*O_NONBLOCK* is - set), no actions are taken and a value of 0 will be returned. - -`CDO_AUTO_CLOSE | CDO_AUTO_EJECT | CDO_LOCK` - This mimics the behavior of the current sbpcd-driver. The option flags are - ignored, the tray is closed on the first open, if necessary. Similarly, - the tray is opened on the last release, i. e., if a CD-ROM is unmounted, - it is automatically ejected, such that the user can replace it. - -We hope that these option can convince everybody (both driver -maintainers and user program developers) to adopt the new CD-ROM -driver scheme and option flag interpretation. - -Description of routines in `cdrom.c` -==================================== - -Only a few routines in `cdrom.c` are exported to the drivers. In this -new section we will discuss these, as well as the functions that `take -over' the CD-ROM interface to the kernel. The header file belonging -to `cdrom.c` is called `cdrom.h`. Formerly, some of the contents of this -file were placed in the file `ucdrom.h`, but this file has now been -merged back into `cdrom.h`. - -:: - - struct file_operations cdrom_fops - -The contents of this structure were described in cdrom_api_. -A pointer to this structure is assigned to the *fops* field -of the *struct gendisk*. - -:: - - int register_cdrom(struct cdrom_device_info *cdi) - -This function is used in about the same way one registers *cdrom_fops* -with the kernel, the device operations and information structures, -as described in cdrom_api_, should be registered with the -Uniform CD-ROM Driver:: - - register_cdrom(&_info); - - -This function returns zero upon success, and non-zero upon -failure. The structure *_info* should have a pointer to the -driver's *_dops*, as in:: - - struct cdrom_device_info _info = { - _dops; - ... - } - -Note that a driver must have one static structure, *_dops*, while -it may have as many structures *_info* as there are minor devices -active. *Register_cdrom()* builds a linked list from these. - - -:: - - void unregister_cdrom(struct cdrom_device_info *cdi) - -Unregistering device *cdi* with minor number *MINOR(cdi->dev)* removes -the minor device from the list. If it was the last registered minor for -the low-level driver, this disconnects the registered device-operation -routines from the CD-ROM interface. This function returns zero upon -success, and non-zero upon failure. - -:: - - int cdrom_open(struct inode * ip, struct file * fp) - -This function is not called directly by the low-level drivers, it is -listed in the standard *cdrom_fops*. If the VFS opens a file, this -function becomes active. A strategy is implemented in this routine, -taking care of all capabilities and options that are set in the -*cdrom_device_ops* connected to the device. Then, the program flow is -transferred to the device_dependent *open()* call. - -:: - - void cdrom_release(struct inode *ip, struct file *fp) - -This function implements the reverse-logic of *cdrom_open()*, and then -calls the device-dependent *release()* routine. When the use-count has -reached 0, the allocated buffers are flushed by calls to *sync_dev(dev)* -and *invalidate_buffers(dev)*. - - -.. _cdrom_ioctl: - -:: - - int cdrom_ioctl(struct inode *ip, struct file *fp, - unsigned int cmd, unsigned long arg) - -This function handles all the standard *ioctl* requests for CD-ROM -devices in a uniform way. The different calls fall into three -categories: *ioctl()'s* that can be directly implemented by device -operations, ones that are routed through the call *audio_ioctl()*, and -the remaining ones, that are presumable device-dependent. Generally, a -negative return value indicates an error. - -Directly implemented *ioctl()'s* --------------------------------- - -The following `old` CD-ROM *ioctl()*\ 's are implemented by directly -calling device-operations in *cdrom_device_ops*, if implemented and -not masked: - -`CDROMMULTISESSION` - Requests the last session on a CD-ROM. -`CDROMEJECT` - Open tray. -`CDROMCLOSETRAY` - Close tray. -`CDROMEJECT_SW` - If *arg\not=0*, set behavior to auto-close (close - tray on first open) and auto-eject (eject on last release), otherwise - set behavior to non-moving on *open()* and *release()* calls. -`CDROM_GET_MCN` - Get the Media Catalog Number from a CD. - -*Ioctl*s routed through *audio_ioctl()* ---------------------------------------- - -The following set of *ioctl()'s* are all implemented through a call to -the *cdrom_fops* function *audio_ioctl()*. Memory checks and -allocation are performed in *cdrom_ioctl()*, and also sanitization of -address format (*CDROM_LBA*/*CDROM_MSF*) is done. - -`CDROMSUBCHNL` - Get sub-channel data in argument *arg* of type - `struct cdrom_subchnl *`. -`CDROMREADTOCHDR` - Read Table of Contents header, in *arg* of type - `struct cdrom_tochdr *`. -`CDROMREADTOCENTRY` - Read a Table of Contents entry in *arg* and specified by *arg* - of type `struct cdrom_tocentry *`. -`CDROMPLAYMSF` - Play audio fragment specified in Minute, Second, Frame format, - delimited by *arg* of type `struct cdrom_msf *`. -`CDROMPLAYTRKIND` - Play audio fragment in track-index format delimited by *arg* - of type `struct cdrom_ti *`. -`CDROMVOLCTRL` - Set volume specified by *arg* of type `struct cdrom_volctrl *`. -`CDROMVOLREAD` - Read volume into by *arg* of type `struct cdrom_volctrl *`. -`CDROMSTART` - Spin up disc. -`CDROMSTOP` - Stop playback of audio fragment. -`CDROMPAUSE` - Pause playback of audio fragment. -`CDROMRESUME` - Resume playing. - -New *ioctl()'s* in `cdrom.c` ----------------------------- - -The following *ioctl()'s* have been introduced to allow user programs to -control the behavior of individual CD-ROM devices. New *ioctl* -commands can be identified by the underscores in their names. - -`CDROM_SET_OPTIONS` - Set options specified by *arg*. Returns the option flag register - after modification. Use *arg = \rm0* for reading the current flags. -`CDROM_CLEAR_OPTIONS` - Clear options specified by *arg*. Returns the option flag register - after modification. -`CDROM_SELECT_SPEED` - Select head-rate speed of disc specified as by *arg* in units - of standard cdrom speed (176\,kB/sec raw data or - 150kB/sec file system data). The value 0 means `auto-select`, - i. e., play audio discs at real time and data discs at maximum speed. - The value *arg* is checked against the maximum head rate of the - drive found in the *cdrom_dops*. -`CDROM_SELECT_DISC` - Select disc numbered *arg* from a juke-box. - - First disc is numbered 0. The number *arg* is checked against the - maximum number of discs in the juke-box found in the *cdrom_dops*. -`CDROM_MEDIA_CHANGED` - Returns 1 if a disc has been changed since the last call. - Note that calls to *cdrom_media_changed* by the VFS are treated - by an independent queue, so both mechanisms will detect a - media change once. For juke-boxes, an extra argument *arg* - specifies the slot for which the information is given. The special - value *CDSL_CURRENT* requests that information about the currently - selected slot be returned. -`CDROM_DRIVE_STATUS` - Returns the status of the drive by a call to - *drive_status()*. Return values are defined in cdrom_drive_status_. - Note that this call doesn't return information on the - current playing activity of the drive; this can be polled through - an *ioctl* call to *CDROMSUBCHNL*. For juke-boxes, an extra argument - *arg* specifies the slot for which (possibly limited) information is - given. The special value *CDSL_CURRENT* requests that information - about the currently selected slot be returned. -`CDROM_DISC_STATUS` - Returns the type of the disc currently in the drive. - It should be viewed as a complement to *CDROM_DRIVE_STATUS*. - This *ioctl* can provide *some* information about the current - disc that is inserted in the drive. This functionality used to be - implemented in the low level drivers, but is now carried out - entirely in Uniform CD-ROM Driver. - - The history of development of the CD's use as a carrier medium for - various digital information has lead to many different disc types. - This *ioctl* is useful only in the case that CDs have \emph {only - one} type of data on them. While this is often the case, it is - also very common for CDs to have some tracks with data, and some - tracks with audio. Because this is an existing interface, rather - than fixing this interface by changing the assumptions it was made - under, thereby breaking all user applications that use this - function, the Uniform CD-ROM Driver implements this *ioctl* as - follows: If the CD in question has audio tracks on it, and it has - absolutely no CD-I, XA, or data tracks on it, it will be reported - as *CDS_AUDIO*. If it has both audio and data tracks, it will - return *CDS_MIXED*. If there are no audio tracks on the disc, and - if the CD in question has any CD-I tracks on it, it will be - reported as *CDS_XA_2_2*. Failing that, if the CD in question - has any XA tracks on it, it will be reported as *CDS_XA_2_1*. - Finally, if the CD in question has any data tracks on it, - it will be reported as a data CD (*CDS_DATA_1*). - - This *ioctl* can return:: - - CDS_NO_INFO /* no information available */ - CDS_NO_DISC /* no disc is inserted, or tray is opened */ - CDS_AUDIO /* Audio disc (2352 audio bytes/frame) */ - CDS_DATA_1 /* data disc, mode 1 (2048 user bytes/frame) */ - CDS_XA_2_1 /* mixed data (XA), mode 2, form 1 (2048 user bytes) */ - CDS_XA_2_2 /* mixed data (XA), mode 2, form 1 (2324 user bytes) */ - CDS_MIXED /* mixed audio/data disc */ - - For some information concerning frame layout of the various disc - types, see a recent version of `cdrom.h`. - -`CDROM_CHANGER_NSLOTS` - Returns the number of slots in a juke-box. -`CDROMRESET` - Reset the drive. -`CDROM_GET_CAPABILITY` - Returns the *capability* flags for the drive. Refer to section - cdrom_capabilities_ for more information on these flags. -`CDROM_LOCKDOOR` - Locks the door of the drive. `arg == 0` unlocks the door, - any other value locks it. -`CDROM_DEBUG` - Turns on debugging info. Only root is allowed to do this. - Same semantics as CDROM_LOCKDOOR. - - -Device dependent *ioctl()'s* ----------------------------- - -Finally, all other *ioctl()'s* are passed to the function *dev_ioctl()*, -if implemented. No memory allocation or verification is carried out. - -How to update your driver -========================= - -- Make a backup of your current driver. -- Get hold of the files `cdrom.c` and `cdrom.h`, they should be in - the directory tree that came with this documentation. -- Make sure you include `cdrom.h`. -- Change the 3rd argument of *register_blkdev* from `&_fops` - to `&cdrom_fops`. -- Just after that line, add the following to register with the Uniform - CD-ROM Driver:: - - register_cdrom(&_info);* - - Similarly, add a call to *unregister_cdrom()* at the appropriate place. -- Copy an example of the device-operations *struct* to your - source, e. g., from `cm206.c` *cm206_dops*, and change all - entries to names corresponding to your driver, or names you just - happen to like. If your driver doesn't support a certain function, - make the entry *NULL*. At the entry *capability* you should list all - capabilities your driver currently supports. If your driver - has a capability that is not listed, please send me a message. -- Copy the *cdrom_device_info* declaration from the same example - driver, and modify the entries according to your needs. If your - driver dynamically determines the capabilities of the hardware, this - structure should also be declared dynamically. -- Implement all functions in your `_dops` structure, - according to prototypes listed in `cdrom.h`, and specifications given - in cdrom_api_. Most likely you have already implemented - the code in a large part, and you will almost certainly need to adapt the - prototype and return values. -- Rename your `_ioctl()` function to *audio_ioctl* and - change the prototype a little. Remove entries listed in the first - part in cdrom_ioctl_, if your code was OK, these are - just calls to the routines you adapted in the previous step. -- You may remove all remaining memory checking code in the - *audio_ioctl()* function that deals with audio commands (these are - listed in the second part of cdrom_ioctl_. There is no - need for memory allocation either, so most *case*s in the *switch* - statement look similar to:: - - case CDROMREADTOCENTRY: - get_toc_entry\bigl((struct cdrom_tocentry *) arg); - -- All remaining *ioctl* cases must be moved to a separate - function, *_ioctl*, the device-dependent *ioctl()'s*. Note that - memory checking and allocation must be kept in this code! -- Change the prototypes of *_open()* and - *_release()*, and remove any strategic code (i. e., tray - movement, door locking, etc.). -- Try to recompile the drivers. We advise you to use modules, both - for `cdrom.o` and your driver, as debugging is much easier this - way. - -Thanks -====== - -Thanks to all the people involved. First, Erik Andersen, who has -taken over the torch in maintaining `cdrom.c` and integrating much -CD-ROM-related code in the 2.1-kernel. Thanks to Scott Snyder and -Gerd Knorr, who were the first to implement this interface for SCSI -and IDE-CD drivers and added many ideas for extension of the data -structures relative to kernel~2.0. Further thanks to Heiko Eißfeldt, -Thomas Quinot, Jon Tombs, Ken Pizzini, Eberhard Mönkeberg and Andrew Kroll, -the Linux CD-ROM device driver developers who were kind -enough to give suggestions and criticisms during the writing. Finally -of course, I want to thank Linus Torvalds for making this possible in -the first place. diff --git a/Documentation/cdrom/ide-cd b/Documentation/cdrom/ide-cd deleted file mode 100644 index a5f2a7f1ff46..000000000000 --- a/Documentation/cdrom/ide-cd +++ /dev/null @@ -1,534 +0,0 @@ -IDE-CD driver documentation -Originally by scott snyder (19 May 1996) -Carrying on the torch is: Erik Andersen -New maintainers (19 Oct 1998): Jens Axboe - -1. Introduction ---------------- - -The ide-cd driver should work with all ATAPI ver 1.2 to ATAPI 2.6 compliant -CDROM drives which attach to an IDE interface. Note that some CDROM vendors -(including Mitsumi, Sony, Creative, Aztech, and Goldstar) have made -both ATAPI-compliant drives and drives which use a proprietary -interface. If your drive uses one of those proprietary interfaces, -this driver will not work with it (but one of the other CDROM drivers -probably will). This driver will not work with `ATAPI' drives which -attach to the parallel port. In addition, there is at least one drive -(CyCDROM CR520ie) which attaches to the IDE port but is not ATAPI; -this driver will not work with drives like that either (but see the -aztcd driver). - -This driver provides the following features: - - - Reading from data tracks, and mounting ISO 9660 filesystems. - - - Playing audio tracks. Most of the CDROM player programs floating - around should work; I usually use Workman. - - - Multisession support. - - - On drives which support it, reading digital audio data directly - from audio tracks. The program cdda2wav can be used for this. - Note, however, that only some drives actually support this. - - - There is now support for CDROM changers which comply with the - ATAPI 2.6 draft standard (such as the NEC CDR-251). This additional - functionality includes a function call to query which slot is the - currently selected slot, a function call to query which slots contain - CDs, etc. A sample program which demonstrates this functionality is - appended to the end of this file. The Sanyo 3-disc changer - (which does not conform to the standard) is also now supported. - Please note the driver refers to the first CD as slot # 0. - - -2. Installation ---------------- - -0. The ide-cd relies on the ide disk driver. See - Documentation/ide/ide.txt for up-to-date information on the ide - driver. - -1. Make sure that the ide and ide-cd drivers are compiled into the - kernel you're using. When configuring the kernel, in the section - entitled "Floppy, IDE, and other block devices", say either `Y' - (which will compile the support directly into the kernel) or `M' - (to compile support as a module which can be loaded and unloaded) - to the options: - - ATA/ATAPI/MFM/RLL support - Include IDE/ATAPI CDROM support - - Depending on what type of IDE interface you have, you may need to - specify additional configuration options. See - Documentation/ide/ide.txt. - -2. You should also ensure that the iso9660 filesystem is either - compiled into the kernel or available as a loadable module. You - can see if a filesystem is known to the kernel by catting - /proc/filesystems. - -3. The CDROM drive should be connected to the host on an IDE - interface. Each interface on a system is defined by an I/O port - address and an IRQ number, the standard assignments being - 0x1f0 and 14 for the primary interface and 0x170 and 15 for the - secondary interface. Each interface can control up to two devices, - where each device can be a hard drive, a CDROM drive, a floppy drive, - or a tape drive. The two devices on an interface are called `master' - and `slave'; this is usually selectable via a jumper on the drive. - - Linux names these devices as follows. The master and slave devices - on the primary IDE interface are called `hda' and `hdb', - respectively. The drives on the secondary interface are called - `hdc' and `hdd'. (Interfaces at other locations get other letters - in the third position; see Documentation/ide/ide.txt.) - - If you want your CDROM drive to be found automatically by the - driver, you should make sure your IDE interface uses either the - primary or secondary addresses mentioned above. In addition, if - the CDROM drive is the only device on the IDE interface, it should - be jumpered as `master'. (If for some reason you cannot configure - your system in this manner, you can probably still use the driver. - You may have to pass extra configuration information to the kernel - when you boot, however. See Documentation/ide/ide.txt for more - information.) - -4. Boot the system. If the drive is recognized, you should see a - message which looks like - - hdb: NEC CD-ROM DRIVE:260, ATAPI CDROM drive - - If you do not see this, see section 5 below. - -5. You may want to create a symbolic link /dev/cdrom pointing to the - actual device. You can do this with the command - - ln -s /dev/hdX /dev/cdrom - - where X should be replaced by the letter indicating where your - drive is installed. - -6. You should be able to see any error messages from the driver with - the `dmesg' command. - - -3. Basic usage --------------- - -An ISO 9660 CDROM can be mounted by putting the disc in the drive and -typing (as root) - - mount -t iso9660 /dev/cdrom /mnt/cdrom - -where it is assumed that /dev/cdrom is a link pointing to the actual -device (as described in step 5 of the last section) and /mnt/cdrom is -an empty directory. You should now be able to see the contents of the -CDROM under the /mnt/cdrom directory. If you want to eject the CDROM, -you must first dismount it with a command like - - umount /mnt/cdrom - -Note that audio CDs cannot be mounted. - -Some distributions set up /etc/fstab to always try to mount a CDROM -filesystem on bootup. It is not required to mount the CDROM in this -manner, though, and it may be a nuisance if you change CDROMs often. -You should feel free to remove the cdrom line from /etc/fstab and -mount CDROMs manually if that suits you better. - -Multisession and photocd discs should work with no special handling. -The hpcdtoppm package (ftp.gwdg.de:/pub/linux/hpcdtoppm/) may be -useful for reading photocds. - -To play an audio CD, you should first unmount and remove any data -CDROM. Any of the CDROM player programs should then work (workman, -workbone, cdplayer, etc.). - -On a few drives, you can read digital audio directly using a program -such as cdda2wav. The only types of drive which I've heard support -this are Sony and Toshiba drives. You will get errors if you try to -use this function on a drive which does not support it. - -For supported changers, you can use the `cdchange' program (appended to -the end of this file) to switch between changer slots. Note that the -drive should be unmounted before attempting this. The program takes -two arguments: the CDROM device, and the slot number to which you wish -to change. If the slot number is -1, the drive is unloaded. - - -4. Common problems ------------------- - -This section discusses some common problems encountered when trying to -use the driver, and some possible solutions. Note that if you are -experiencing problems, you should probably also review -Documentation/ide/ide.txt for current information about the underlying -IDE support code. Some of these items apply only to earlier versions -of the driver, but are mentioned here for completeness. - -In most cases, you should probably check with `dmesg' for any errors -from the driver. - -a. Drive is not detected during booting. - - - Review the configuration instructions above and in - Documentation/ide/ide.txt, and check how your hardware is - configured. - - - If your drive is the only device on an IDE interface, it should - be jumpered as master, if at all possible. - - - If your IDE interface is not at the standard addresses of 0x170 - or 0x1f0, you'll need to explicitly inform the driver using a - lilo option. See Documentation/ide/ide.txt. (This feature was - added around kernel version 1.3.30.) - - - If the autoprobing is not finding your drive, you can tell the - driver to assume that one exists by using a lilo option of the - form `hdX=cdrom', where X is the drive letter corresponding to - where your drive is installed. Note that if you do this and you - see a boot message like - - hdX: ATAPI cdrom (?) - - this does _not_ mean that the driver has successfully detected - the drive; rather, it means that the driver has not detected a - drive, but is assuming there's one there anyway because you told - it so. If you actually try to do I/O to a drive defined at a - nonexistent or nonresponding I/O address, you'll probably get - errors with a status value of 0xff. - - - Some IDE adapters require a nonstandard initialization sequence - before they'll function properly. (If this is the case, there - will often be a separate MS-DOS driver just for the controller.) - IDE interfaces on sound cards often fall into this category. - - Support for some interfaces needing extra initialization is - provided in later 1.3.x kernels. You may need to turn on - additional kernel configuration options to get them to work; - see Documentation/ide/ide.txt. - - Even if support is not available for your interface, you may be - able to get it to work with the following procedure. First boot - MS-DOS and load the appropriate drivers. Then warm-boot linux - (i.e., without powering off). If this works, it can be automated - by running loadlin from the MS-DOS autoexec. - - -b. Timeout/IRQ errors. - - - If you always get timeout errors, interrupts from the drive are - probably not making it to the host. - - - IRQ problems may also be indicated by the message - `IRQ probe failed ()' while booting. If is zero, that - means that the system did not see an interrupt from the drive when - it was expecting one (on any feasible IRQ). If is negative, - that means the system saw interrupts on multiple IRQ lines, when - it was expecting to receive just one from the CDROM drive. - - - Double-check your hardware configuration to make sure that the IRQ - number of your IDE interface matches what the driver expects. - (The usual assignments are 14 for the primary (0x1f0) interface - and 15 for the secondary (0x170) interface.) Also be sure that - you don't have some other hardware which might be conflicting with - the IRQ you're using. Also check the BIOS setup for your system; - some have the ability to disable individual IRQ levels, and I've - had one report of a system which was shipped with IRQ 15 disabled - by default. - - - Note that many MS-DOS CDROM drivers will still function even if - there are hardware problems with the interrupt setup; they - apparently don't use interrupts. - - - If you own a Pioneer DR-A24X, you _will_ get nasty error messages - on boot such as "irq timeout: status=0x50 { DriveReady SeekComplete }" - The Pioneer DR-A24X CDROM drives are fairly popular these days. - Unfortunately, these drives seem to become very confused when we perform - the standard Linux ATA disk drive probe. If you own one of these drives, - you can bypass the ATA probing which confuses these CDROM drives, by - adding `append="hdX=noprobe hdX=cdrom"' to your lilo.conf file and running - lilo (again where X is the drive letter corresponding to where your drive - is installed.) - -c. System hangups. - - - If the system locks up when you try to access the CDROM, the most - likely cause is that you have a buggy IDE adapter which doesn't - properly handle simultaneous transactions on multiple interfaces. - The most notorious of these is the CMD640B chip. This problem can - be worked around by specifying the `serialize' option when - booting. Recent kernels should be able to detect the need for - this automatically in most cases, but the detection is not - foolproof. See Documentation/ide/ide.txt for more information - about the `serialize' option and the CMD640B. - - - Note that many MS-DOS CDROM drivers will work with such buggy - hardware, apparently because they never attempt to overlap CDROM - operations with other disk activity. - - -d. Can't mount a CDROM. - - - If you get errors from mount, it may help to check `dmesg' to see - if there are any more specific errors from the driver or from the - filesystem. - - - Make sure there's a CDROM loaded in the drive, and that's it's an - ISO 9660 disc. You can't mount an audio CD. - - - With the CDROM in the drive and unmounted, try something like - - cat /dev/cdrom | od | more - - If you see a dump, then the drive and driver are probably working - OK, and the problem is at the filesystem level (i.e., the CDROM is - not ISO 9660 or has errors in the filesystem structure). - - - If you see `not a block device' errors, check that the definitions - of the device special files are correct. They should be as - follows: - - brw-rw---- 1 root disk 3, 0 Nov 11 18:48 /dev/hda - brw-rw---- 1 root disk 3, 64 Nov 11 18:48 /dev/hdb - brw-rw---- 1 root disk 22, 0 Nov 11 18:48 /dev/hdc - brw-rw---- 1 root disk 22, 64 Nov 11 18:48 /dev/hdd - - Some early Slackware releases had these defined incorrectly. If - these are wrong, you can remake them by running the script - scripts/MAKEDEV.ide. (You may have to make it executable - with chmod first.) - - If you have a /dev/cdrom symbolic link, check that it is pointing - to the correct device file. - - If you hear people talking of the devices `hd1a' and `hd1b', these - were old names for what are now called hdc and hdd. Those names - should be considered obsolete. - - - If mount is complaining that the iso9660 filesystem is not - available, but you know it is (check /proc/filesystems), you - probably need a newer version of mount. Early versions would not - always give meaningful error messages. - - -e. Directory listings are unpredictably truncated, and `dmesg' shows - `buffer botch' error messages from the driver. - - - There was a bug in the version of the driver in 1.2.x kernels - which could cause this. It was fixed in 1.3.0. If you can't - upgrade, you can probably work around the problem by specifying a - blocksize of 2048 when mounting. (Note that you won't be able to - directly execute binaries off the CDROM in that case.) - - If you see this in kernels later than 1.3.0, please report it as a - bug. - - -f. Data corruption. - - - Random data corruption was occasionally observed with the Hitachi - CDR-7730 CDROM. If you experience data corruption, using "hdx=slow" - as a command line parameter may work around the problem, at the - expense of low system performance. - - -5. cdchange.c -------------- - -/* - * cdchange.c [-v] [] - * - * This loads a CDROM from a specified slot in a changer, and displays - * information about the changer status. The drive should be unmounted before - * using this program. - * - * Changer information is displayed if either the -v flag is specified - * or no slot was specified. - * - * Based on code originally from Gerhard Zuber . - * Changer status information, and rewrite for the new Uniform CDROM driver - * interface by Erik Andersen . - */ - -#include -#include -#include -#include -#include -#include -#include -#include - - -int -main (int argc, char **argv) -{ - char *program; - char *device; - int fd; /* file descriptor for CD-ROM device */ - int status; /* return status for system calls */ - int verbose = 0; - int slot=-1, x_slot; - int total_slots_available; - - program = argv[0]; - - ++argv; - --argc; - - if (argc < 1 || argc > 3) { - fprintf (stderr, "usage: %s [-v] []\n", - program); - fprintf (stderr, " Slots are numbered 1 -- n.\n"); - exit (1); - } - - if (strcmp (argv[0], "-v") == 0) { - verbose = 1; - ++argv; - --argc; - } - - device = argv[0]; - - if (argc == 2) - slot = atoi (argv[1]) - 1; - - /* open device */ - fd = open(device, O_RDONLY | O_NONBLOCK); - if (fd < 0) { - fprintf (stderr, "%s: open failed for `%s': %s\n", - program, device, strerror (errno)); - exit (1); - } - - /* Check CD player status */ - total_slots_available = ioctl (fd, CDROM_CHANGER_NSLOTS); - if (total_slots_available <= 1 ) { - fprintf (stderr, "%s: Device `%s' is not an ATAPI " - "compliant CD changer.\n", program, device); - exit (1); - } - - if (slot >= 0) { - if (slot >= total_slots_available) { - fprintf (stderr, "Bad slot number. " - "Should be 1 -- %d.\n", - total_slots_available); - exit (1); - } - - /* load */ - slot=ioctl (fd, CDROM_SELECT_DISC, slot); - if (slot<0) { - fflush(stdout); - perror ("CDROM_SELECT_DISC "); - exit(1); - } - } - - if (slot < 0 || verbose) { - - status=ioctl (fd, CDROM_SELECT_DISC, CDSL_CURRENT); - if (status<0) { - fflush(stdout); - perror (" CDROM_SELECT_DISC"); - exit(1); - } - slot=status; - - printf ("Current slot: %d\n", slot+1); - printf ("Total slots available: %d\n", - total_slots_available); - - printf ("Drive status: "); - status = ioctl (fd, CDROM_DRIVE_STATUS, CDSL_CURRENT); - if (status<0) { - perror(" CDROM_DRIVE_STATUS"); - } else switch(status) { - case CDS_DISC_OK: - printf ("Ready.\n"); - break; - case CDS_TRAY_OPEN: - printf ("Tray Open.\n"); - break; - case CDS_DRIVE_NOT_READY: - printf ("Drive Not Ready.\n"); - break; - default: - printf ("This Should not happen!\n"); - break; - } - - for (x_slot=0; x_slot (19 May 1996) +:Carrying on the torch is: Erik Andersen +:New maintainers (19 Oct 1998): Jens Axboe + +1. Introduction +--------------- + +The ide-cd driver should work with all ATAPI ver 1.2 to ATAPI 2.6 compliant +CDROM drives which attach to an IDE interface. Note that some CDROM vendors +(including Mitsumi, Sony, Creative, Aztech, and Goldstar) have made +both ATAPI-compliant drives and drives which use a proprietary +interface. If your drive uses one of those proprietary interfaces, +this driver will not work with it (but one of the other CDROM drivers +probably will). This driver will not work with `ATAPI` drives which +attach to the parallel port. In addition, there is at least one drive +(CyCDROM CR520ie) which attaches to the IDE port but is not ATAPI; +this driver will not work with drives like that either (but see the +aztcd driver). + +This driver provides the following features: + + - Reading from data tracks, and mounting ISO 9660 filesystems. + + - Playing audio tracks. Most of the CDROM player programs floating + around should work; I usually use Workman. + + - Multisession support. + + - On drives which support it, reading digital audio data directly + from audio tracks. The program cdda2wav can be used for this. + Note, however, that only some drives actually support this. + + - There is now support for CDROM changers which comply with the + ATAPI 2.6 draft standard (such as the NEC CDR-251). This additional + functionality includes a function call to query which slot is the + currently selected slot, a function call to query which slots contain + CDs, etc. A sample program which demonstrates this functionality is + appended to the end of this file. The Sanyo 3-disc changer + (which does not conform to the standard) is also now supported. + Please note the driver refers to the first CD as slot # 0. + + +2. Installation +--------------- + +0. The ide-cd relies on the ide disk driver. See + Documentation/ide/ide.txt for up-to-date information on the ide + driver. + +1. Make sure that the ide and ide-cd drivers are compiled into the + kernel you're using. When configuring the kernel, in the section + entitled "Floppy, IDE, and other block devices", say either `Y` + (which will compile the support directly into the kernel) or `M` + (to compile support as a module which can be loaded and unloaded) + to the options:: + + ATA/ATAPI/MFM/RLL support + Include IDE/ATAPI CDROM support + + Depending on what type of IDE interface you have, you may need to + specify additional configuration options. See + Documentation/ide/ide.txt. + +2. You should also ensure that the iso9660 filesystem is either + compiled into the kernel or available as a loadable module. You + can see if a filesystem is known to the kernel by catting + /proc/filesystems. + +3. The CDROM drive should be connected to the host on an IDE + interface. Each interface on a system is defined by an I/O port + address and an IRQ number, the standard assignments being + 0x1f0 and 14 for the primary interface and 0x170 and 15 for the + secondary interface. Each interface can control up to two devices, + where each device can be a hard drive, a CDROM drive, a floppy drive, + or a tape drive. The two devices on an interface are called `master` + and `slave`; this is usually selectable via a jumper on the drive. + + Linux names these devices as follows. The master and slave devices + on the primary IDE interface are called `hda` and `hdb`, + respectively. The drives on the secondary interface are called + `hdc` and `hdd`. (Interfaces at other locations get other letters + in the third position; see Documentation/ide/ide.txt.) + + If you want your CDROM drive to be found automatically by the + driver, you should make sure your IDE interface uses either the + primary or secondary addresses mentioned above. In addition, if + the CDROM drive is the only device on the IDE interface, it should + be jumpered as `master`. (If for some reason you cannot configure + your system in this manner, you can probably still use the driver. + You may have to pass extra configuration information to the kernel + when you boot, however. See Documentation/ide/ide.txt for more + information.) + +4. Boot the system. If the drive is recognized, you should see a + message which looks like:: + + hdb: NEC CD-ROM DRIVE:260, ATAPI CDROM drive + + If you do not see this, see section 5 below. + +5. You may want to create a symbolic link /dev/cdrom pointing to the + actual device. You can do this with the command:: + + ln -s /dev/hdX /dev/cdrom + + where X should be replaced by the letter indicating where your + drive is installed. + +6. You should be able to see any error messages from the driver with + the `dmesg` command. + + +3. Basic usage +-------------- + +An ISO 9660 CDROM can be mounted by putting the disc in the drive and +typing (as root):: + + mount -t iso9660 /dev/cdrom /mnt/cdrom + +where it is assumed that /dev/cdrom is a link pointing to the actual +device (as described in step 5 of the last section) and /mnt/cdrom is +an empty directory. You should now be able to see the contents of the +CDROM under the /mnt/cdrom directory. If you want to eject the CDROM, +you must first dismount it with a command like:: + + umount /mnt/cdrom + +Note that audio CDs cannot be mounted. + +Some distributions set up /etc/fstab to always try to mount a CDROM +filesystem on bootup. It is not required to mount the CDROM in this +manner, though, and it may be a nuisance if you change CDROMs often. +You should feel free to remove the cdrom line from /etc/fstab and +mount CDROMs manually if that suits you better. + +Multisession and photocd discs should work with no special handling. +The hpcdtoppm package (ftp.gwdg.de:/pub/linux/hpcdtoppm/) may be +useful for reading photocds. + +To play an audio CD, you should first unmount and remove any data +CDROM. Any of the CDROM player programs should then work (workman, +workbone, cdplayer, etc.). + +On a few drives, you can read digital audio directly using a program +such as cdda2wav. The only types of drive which I've heard support +this are Sony and Toshiba drives. You will get errors if you try to +use this function on a drive which does not support it. + +For supported changers, you can use the `cdchange` program (appended to +the end of this file) to switch between changer slots. Note that the +drive should be unmounted before attempting this. The program takes +two arguments: the CDROM device, and the slot number to which you wish +to change. If the slot number is -1, the drive is unloaded. + + +4. Common problems +------------------ + +This section discusses some common problems encountered when trying to +use the driver, and some possible solutions. Note that if you are +experiencing problems, you should probably also review +Documentation/ide/ide.txt for current information about the underlying +IDE support code. Some of these items apply only to earlier versions +of the driver, but are mentioned here for completeness. + +In most cases, you should probably check with `dmesg` for any errors +from the driver. + +a. Drive is not detected during booting. + + - Review the configuration instructions above and in + Documentation/ide/ide.txt, and check how your hardware is + configured. + + - If your drive is the only device on an IDE interface, it should + be jumpered as master, if at all possible. + + - If your IDE interface is not at the standard addresses of 0x170 + or 0x1f0, you'll need to explicitly inform the driver using a + lilo option. See Documentation/ide/ide.txt. (This feature was + added around kernel version 1.3.30.) + + - If the autoprobing is not finding your drive, you can tell the + driver to assume that one exists by using a lilo option of the + form `hdX=cdrom`, where X is the drive letter corresponding to + where your drive is installed. Note that if you do this and you + see a boot message like:: + + hdX: ATAPI cdrom (?) + + this does _not_ mean that the driver has successfully detected + the drive; rather, it means that the driver has not detected a + drive, but is assuming there's one there anyway because you told + it so. If you actually try to do I/O to a drive defined at a + nonexistent or nonresponding I/O address, you'll probably get + errors with a status value of 0xff. + + - Some IDE adapters require a nonstandard initialization sequence + before they'll function properly. (If this is the case, there + will often be a separate MS-DOS driver just for the controller.) + IDE interfaces on sound cards often fall into this category. + + Support for some interfaces needing extra initialization is + provided in later 1.3.x kernels. You may need to turn on + additional kernel configuration options to get them to work; + see Documentation/ide/ide.txt. + + Even if support is not available for your interface, you may be + able to get it to work with the following procedure. First boot + MS-DOS and load the appropriate drivers. Then warm-boot linux + (i.e., without powering off). If this works, it can be automated + by running loadlin from the MS-DOS autoexec. + + +b. Timeout/IRQ errors. + + - If you always get timeout errors, interrupts from the drive are + probably not making it to the host. + + - IRQ problems may also be indicated by the message + `IRQ probe failed ()` while booting. If is zero, that + means that the system did not see an interrupt from the drive when + it was expecting one (on any feasible IRQ). If is negative, + that means the system saw interrupts on multiple IRQ lines, when + it was expecting to receive just one from the CDROM drive. + + - Double-check your hardware configuration to make sure that the IRQ + number of your IDE interface matches what the driver expects. + (The usual assignments are 14 for the primary (0x1f0) interface + and 15 for the secondary (0x170) interface.) Also be sure that + you don't have some other hardware which might be conflicting with + the IRQ you're using. Also check the BIOS setup for your system; + some have the ability to disable individual IRQ levels, and I've + had one report of a system which was shipped with IRQ 15 disabled + by default. + + - Note that many MS-DOS CDROM drivers will still function even if + there are hardware problems with the interrupt setup; they + apparently don't use interrupts. + + - If you own a Pioneer DR-A24X, you _will_ get nasty error messages + on boot such as "irq timeout: status=0x50 { DriveReady SeekComplete }" + The Pioneer DR-A24X CDROM drives are fairly popular these days. + Unfortunately, these drives seem to become very confused when we perform + the standard Linux ATA disk drive probe. If you own one of these drives, + you can bypass the ATA probing which confuses these CDROM drives, by + adding `append="hdX=noprobe hdX=cdrom"` to your lilo.conf file and running + lilo (again where X is the drive letter corresponding to where your drive + is installed.) + +c. System hangups. + + - If the system locks up when you try to access the CDROM, the most + likely cause is that you have a buggy IDE adapter which doesn't + properly handle simultaneous transactions on multiple interfaces. + The most notorious of these is the CMD640B chip. This problem can + be worked around by specifying the `serialize` option when + booting. Recent kernels should be able to detect the need for + this automatically in most cases, but the detection is not + foolproof. See Documentation/ide/ide.txt for more information + about the `serialize` option and the CMD640B. + + - Note that many MS-DOS CDROM drivers will work with such buggy + hardware, apparently because they never attempt to overlap CDROM + operations with other disk activity. + + +d. Can't mount a CDROM. + + - If you get errors from mount, it may help to check `dmesg` to see + if there are any more specific errors from the driver or from the + filesystem. + + - Make sure there's a CDROM loaded in the drive, and that's it's an + ISO 9660 disc. You can't mount an audio CD. + + - With the CDROM in the drive and unmounted, try something like:: + + cat /dev/cdrom | od | more + + If you see a dump, then the drive and driver are probably working + OK, and the problem is at the filesystem level (i.e., the CDROM is + not ISO 9660 or has errors in the filesystem structure). + + - If you see `not a block device` errors, check that the definitions + of the device special files are correct. They should be as + follows:: + + brw-rw---- 1 root disk 3, 0 Nov 11 18:48 /dev/hda + brw-rw---- 1 root disk 3, 64 Nov 11 18:48 /dev/hdb + brw-rw---- 1 root disk 22, 0 Nov 11 18:48 /dev/hdc + brw-rw---- 1 root disk 22, 64 Nov 11 18:48 /dev/hdd + + Some early Slackware releases had these defined incorrectly. If + these are wrong, you can remake them by running the script + scripts/MAKEDEV.ide. (You may have to make it executable + with chmod first.) + + If you have a /dev/cdrom symbolic link, check that it is pointing + to the correct device file. + + If you hear people talking of the devices `hd1a` and `hd1b`, these + were old names for what are now called hdc and hdd. Those names + should be considered obsolete. + + - If mount is complaining that the iso9660 filesystem is not + available, but you know it is (check /proc/filesystems), you + probably need a newer version of mount. Early versions would not + always give meaningful error messages. + + +e. Directory listings are unpredictably truncated, and `dmesg` shows + `buffer botch` error messages from the driver. + + - There was a bug in the version of the driver in 1.2.x kernels + which could cause this. It was fixed in 1.3.0. If you can't + upgrade, you can probably work around the problem by specifying a + blocksize of 2048 when mounting. (Note that you won't be able to + directly execute binaries off the CDROM in that case.) + + If you see this in kernels later than 1.3.0, please report it as a + bug. + + +f. Data corruption. + + - Random data corruption was occasionally observed with the Hitachi + CDR-7730 CDROM. If you experience data corruption, using "hdx=slow" + as a command line parameter may work around the problem, at the + expense of low system performance. + + +5. cdchange.c +------------- + +:: + + /* + * cdchange.c [-v] [] + * + * This loads a CDROM from a specified slot in a changer, and displays + * information about the changer status. The drive should be unmounted before + * using this program. + * + * Changer information is displayed if either the -v flag is specified + * or no slot was specified. + * + * Based on code originally from Gerhard Zuber . + * Changer status information, and rewrite for the new Uniform CDROM driver + * interface by Erik Andersen . + */ + + #include + #include + #include + #include + #include + #include + #include + #include + + + int + main (int argc, char **argv) + { + char *program; + char *device; + int fd; /* file descriptor for CD-ROM device */ + int status; /* return status for system calls */ + int verbose = 0; + int slot=-1, x_slot; + int total_slots_available; + + program = argv[0]; + + ++argv; + --argc; + + if (argc < 1 || argc > 3) { + fprintf (stderr, "usage: %s [-v] []\n", + program); + fprintf (stderr, " Slots are numbered 1 -- n.\n"); + exit (1); + } + + if (strcmp (argv[0], "-v") == 0) { + verbose = 1; + ++argv; + --argc; + } + + device = argv[0]; + + if (argc == 2) + slot = atoi (argv[1]) - 1; + + /* open device */ + fd = open(device, O_RDONLY | O_NONBLOCK); + if (fd < 0) { + fprintf (stderr, "%s: open failed for `%s`: %s\n", + program, device, strerror (errno)); + exit (1); + } + + /* Check CD player status */ + total_slots_available = ioctl (fd, CDROM_CHANGER_NSLOTS); + if (total_slots_available <= 1 ) { + fprintf (stderr, "%s: Device `%s` is not an ATAPI " + "compliant CD changer.\n", program, device); + exit (1); + } + + if (slot >= 0) { + if (slot >= total_slots_available) { + fprintf (stderr, "Bad slot number. " + "Should be 1 -- %d.\n", + total_slots_available); + exit (1); + } + + /* load */ + slot=ioctl (fd, CDROM_SELECT_DISC, slot); + if (slot<0) { + fflush(stdout); + perror ("CDROM_SELECT_DISC "); + exit(1); + } + } + + if (slot < 0 || verbose) { + + status=ioctl (fd, CDROM_SELECT_DISC, CDSL_CURRENT); + if (status<0) { + fflush(stdout); + perror (" CDROM_SELECT_DISC"); + exit(1); + } + slot=status; + + printf ("Current slot: %d\n", slot+1); + printf ("Total slots available: %d\n", + total_slots_available); + + printf ("Drive status: "); + status = ioctl (fd, CDROM_DRIVE_STATUS, CDSL_CURRENT); + if (status<0) { + perror(" CDROM_DRIVE_STATUS"); + } else switch(status) { + case CDS_DISC_OK: + printf ("Ready.\n"); + break; + case CDS_TRAY_OPEN: + printf ("Tray Open.\n"); + break; + case CDS_DRIVE_NOT_READY: + printf ("Drive Not Ready.\n"); + break; + default: + printf ("This Should not happen!\n"); + break; + } + + for (x_slot=0; x_slot= +2KB on such a disc. For example, it should be possible to do:: + + # dvd+rw-format /dev/hdc (only needed if the disc has never + been formatted) + # mkudffs /dev/hdc + # mount /dev/hdc /cdrom -t udf -o rw,noatime + +However, some drives don't follow the specification and expect the +host to perform aligned writes at 32KB boundaries. Other drives do +follow the specification, but suffer bad performance problems if the +writes are not 32KB aligned. + +Both problems can be solved by using the pktcdvd driver, which always +generates aligned writes:: + + # dvd+rw-format /dev/hdc + # pktsetup dev_name /dev/hdc + # mkudffs /dev/pktcdvd/dev_name + # mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime + + +Packet writing for DVD-RAM media +-------------------------------- + +DVD-RAM discs are random writable, so using the pktcdvd driver is not +necessary. However, using the pktcdvd driver can improve performance +in the same way it does for DVD+RW media. + + +Notes +----- + +- CD-RW media can usually not be overwritten more than about 1000 + times, so to avoid unnecessary wear on the media, you should always + use the noatime mount option. + +- Defect management (ie automatic remapping of bad sectors) has not + been implemented yet, so you are likely to get at least some + filesystem corruption if the disc wears out. + +- Since the pktcdvd driver makes the disc appear as a regular block + device with a 2KB block size, you can put any filesystem you like on + the disc. For example, run:: + + # /sbin/mke2fs /dev/pktcdvd/dev_name + + to create an ext2 filesystem on the disc. + + +Using the pktcdvd sysfs interface +--------------------------------- + +Since Linux 2.6.20, the pktcdvd module has a sysfs interface +and can be controlled by it. For example the "pktcdvd" tool uses +this interface. (see http://tom.ist-im-web.de/download/pktcdvd ) + +"pktcdvd" works similar to "pktsetup", e.g.:: + + # pktcdvd -a dev_name /dev/hdc + # mkudffs /dev/pktcdvd/dev_name + # mount -t udf -o rw,noatime /dev/pktcdvd/dev_name /dvdram + # cp files /dvdram + # umount /dvdram + # pktcdvd -r dev_name + + +For a description of the sysfs interface look into the file: + + Documentation/ABI/testing/sysfs-class-pktcdvd + + +Using the pktcdvd debugfs interface +----------------------------------- + +To read pktcdvd device infos in human readable form, do:: + + # cat /sys/kernel/debug/pktcdvd/pktcdvd[0-7]/info + +For a description of the debugfs interface look into the file: + + Documentation/ABI/testing/debugfs-pktcdvd + + + +Links +----- + +See http://fy.chalmers.se/~appro/linux/DVD+RW/ for more information +about DVD writing. diff --git a/Documentation/cdrom/packet-writing.txt b/Documentation/cdrom/packet-writing.txt deleted file mode 100644 index 2834170d821e..000000000000 --- a/Documentation/cdrom/packet-writing.txt +++ /dev/null @@ -1,132 +0,0 @@ -Getting started quick ---------------------- - -- Select packet support in the block device section and UDF support in - the file system section. - -- Compile and install kernel and modules, reboot. - -- You need the udftools package (pktsetup, mkudffs, cdrwtool). - Download from http://sourceforge.net/projects/linux-udf/ - -- Grab a new CD-RW disc and format it (assuming CD-RW is hdc, substitute - as appropriate): - # cdrwtool -d /dev/hdc -q - -- Setup your writer - # pktsetup dev_name /dev/hdc - -- Now you can mount /dev/pktcdvd/dev_name and copy files to it. Enjoy! - # mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime - - -Packet writing for DVD-RW media -------------------------------- - -DVD-RW discs can be written to much like CD-RW discs if they are in -the so called "restricted overwrite" mode. To put a disc in restricted -overwrite mode, run: - - # dvd+rw-format /dev/hdc - -You can then use the disc the same way you would use a CD-RW disc: - - # pktsetup dev_name /dev/hdc - # mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime - - -Packet writing for DVD+RW media -------------------------------- - -According to the DVD+RW specification, a drive supporting DVD+RW discs -shall implement "true random writes with 2KB granularity", which means -that it should be possible to put any filesystem with a block size >= -2KB on such a disc. For example, it should be possible to do: - - # dvd+rw-format /dev/hdc (only needed if the disc has never - been formatted) - # mkudffs /dev/hdc - # mount /dev/hdc /cdrom -t udf -o rw,noatime - -However, some drives don't follow the specification and expect the -host to perform aligned writes at 32KB boundaries. Other drives do -follow the specification, but suffer bad performance problems if the -writes are not 32KB aligned. - -Both problems can be solved by using the pktcdvd driver, which always -generates aligned writes. - - # dvd+rw-format /dev/hdc - # pktsetup dev_name /dev/hdc - # mkudffs /dev/pktcdvd/dev_name - # mount /dev/pktcdvd/dev_name /cdrom -t udf -o rw,noatime - - -Packet writing for DVD-RAM media --------------------------------- - -DVD-RAM discs are random writable, so using the pktcdvd driver is not -necessary. However, using the pktcdvd driver can improve performance -in the same way it does for DVD+RW media. - - -Notes ------ - -- CD-RW media can usually not be overwritten more than about 1000 - times, so to avoid unnecessary wear on the media, you should always - use the noatime mount option. - -- Defect management (ie automatic remapping of bad sectors) has not - been implemented yet, so you are likely to get at least some - filesystem corruption if the disc wears out. - -- Since the pktcdvd driver makes the disc appear as a regular block - device with a 2KB block size, you can put any filesystem you like on - the disc. For example, run: - - # /sbin/mke2fs /dev/pktcdvd/dev_name - - to create an ext2 filesystem on the disc. - - -Using the pktcdvd sysfs interface ---------------------------------- - -Since Linux 2.6.20, the pktcdvd module has a sysfs interface -and can be controlled by it. For example the "pktcdvd" tool uses -this interface. (see http://tom.ist-im-web.de/download/pktcdvd ) - -"pktcdvd" works similar to "pktsetup", e.g.: - - # pktcdvd -a dev_name /dev/hdc - # mkudffs /dev/pktcdvd/dev_name - # mount -t udf -o rw,noatime /dev/pktcdvd/dev_name /dvdram - # cp files /dvdram - # umount /dvdram - # pktcdvd -r dev_name - - -For a description of the sysfs interface look into the file: - - Documentation/ABI/testing/sysfs-class-pktcdvd - - -Using the pktcdvd debugfs interface ------------------------------------ - -To read pktcdvd device infos in human readable form, do: - - # cat /sys/kernel/debug/pktcdvd/pktcdvd[0-7]/info - -For a description of the debugfs interface look into the file: - - Documentation/ABI/testing/debugfs-pktcdvd - - - -Links ------ - -See http://fy.chalmers.se/~appro/linux/DVD+RW/ for more information -about DVD writing. diff --git a/MAINTAINERS b/MAINTAINERS index 92eb34679b26..c95c29735327 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7610,7 +7610,7 @@ IDE/ATAPI DRIVERS M: Borislav Petkov L: linux-ide@vger.kernel.org S: Maintained -F: Documentation/cdrom/ide-cd +F: Documentation/cdrom/ide-cd.rst F: drivers/ide/ide-cd* IDEAPAD LAPTOP EXTRAS DRIVER diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 20bb4bfa4be6..96ec7e0fc1ea 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -347,7 +347,7 @@ config CDROM_PKTCDVD is possible. DVD-RW disks must be in restricted overwrite mode. - See the file + See the file for further information on the use of this driver. To compile this driver as a module, choose M here: the diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c index 5d1e0a4a7d84..ac42ae4651ce 100644 --- a/drivers/cdrom/cdrom.c +++ b/drivers/cdrom/cdrom.c @@ -7,7 +7,7 @@ License. See linux/COPYING for more information. Uniform CD-ROM driver for Linux. - See Documentation/cdrom/cdrom-standard.txt for usage information. + See Documentation/cdrom/cdrom-standard.rst for usage information. The routines in the file provide a uniform interface between the software that uses CD-ROMs and the various low-level drivers that diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 3b15adc6ce98..9d117936bee1 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -9,7 +9,7 @@ * May be copied or modified under the terms of the GNU General Public * License. See linux/COPYING for more information. * - * See Documentation/cdrom/ide-cd for usage information. + * See Documentation/cdrom/ide-cd.rst for usage information. * * Suggestions are welcome. Patches that work are more welcome though. ;-) * -- cgit v1.2.3 From f0ba43774cea3fc14732bb9243ce7238ae8a3202 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 12 Jun 2019 14:52:43 -0300 Subject: docs: convert docs to ReST and rename to *.rst The conversion is actually: - add blank lines and indentation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Bjorn Helgaas Acked-by: Mark Brown Signed-off-by: Jonathan Corbet --- Documentation/device-mapper/cache-policies.rst | 131 +++++++ Documentation/device-mapper/cache-policies.txt | 121 ------ Documentation/device-mapper/cache.rst | 337 +++++++++++++++++ Documentation/device-mapper/cache.txt | 311 ---------------- Documentation/device-mapper/delay.rst | 31 ++ Documentation/device-mapper/delay.txt | 28 -- Documentation/device-mapper/dm-crypt.rst | 173 +++++++++ Documentation/device-mapper/dm-crypt.txt | 162 -------- Documentation/device-mapper/dm-flakey.rst | 74 ++++ Documentation/device-mapper/dm-flakey.txt | 57 --- Documentation/device-mapper/dm-init.rst | 125 +++++++ Documentation/device-mapper/dm-init.txt | 114 ------ Documentation/device-mapper/dm-integrity.rst | 259 +++++++++++++ Documentation/device-mapper/dm-integrity.txt | 233 ------------ Documentation/device-mapper/dm-io.rst | 75 ++++ Documentation/device-mapper/dm-io.txt | 75 ---- Documentation/device-mapper/dm-log.rst | 57 +++ Documentation/device-mapper/dm-log.txt | 54 --- Documentation/device-mapper/dm-queue-length.rst | 48 +++ Documentation/device-mapper/dm-queue-length.txt | 39 -- Documentation/device-mapper/dm-raid.rst | 419 +++++++++++++++++++++ Documentation/device-mapper/dm-raid.txt | 354 ------------------ Documentation/device-mapper/dm-service-time.rst | 101 +++++ Documentation/device-mapper/dm-service-time.txt | 91 ----- Documentation/device-mapper/dm-uevent.rst | 110 ++++++ Documentation/device-mapper/dm-uevent.txt | 97 ----- Documentation/device-mapper/dm-zoned.rst | 146 ++++++++ Documentation/device-mapper/dm-zoned.txt | 144 -------- Documentation/device-mapper/era.rst | 116 ++++++ Documentation/device-mapper/era.txt | 108 ------ Documentation/device-mapper/index.rst | 44 +++ Documentation/device-mapper/kcopyd.rst | 47 +++ Documentation/device-mapper/kcopyd.txt | 47 --- Documentation/device-mapper/linear.rst | 63 ++++ Documentation/device-mapper/linear.txt | 61 ---- Documentation/device-mapper/log-writes.rst | 145 ++++++++ Documentation/device-mapper/log-writes.txt | 140 ------- Documentation/device-mapper/persistent-data.rst | 88 +++++ Documentation/device-mapper/persistent-data.txt | 84 ----- Documentation/device-mapper/snapshot.rst | 180 +++++++++ Documentation/device-mapper/snapshot.txt | 176 --------- Documentation/device-mapper/statistics.rst | 225 ++++++++++++ Documentation/device-mapper/statistics.txt | 223 ----------- Documentation/device-mapper/striped.rst | 61 ++++ Documentation/device-mapper/striped.txt | 57 --- Documentation/device-mapper/switch.rst | 141 +++++++ Documentation/device-mapper/switch.txt | 138 ------- Documentation/device-mapper/thin-provisioning.rst | 427 ++++++++++++++++++++++ Documentation/device-mapper/thin-provisioning.txt | 411 --------------------- Documentation/device-mapper/unstriped.rst | 135 +++++++ Documentation/device-mapper/unstriped.txt | 124 ------- Documentation/device-mapper/verity.rst | 229 ++++++++++++ Documentation/device-mapper/verity.txt | 219 ----------- Documentation/device-mapper/writecache.rst | 79 ++++ Documentation/device-mapper/writecache.txt | 70 ---- Documentation/device-mapper/zero.rst | 37 ++ Documentation/device-mapper/zero.txt | 37 -- Documentation/filesystems/ubifs-authentication.md | 4 +- drivers/md/Kconfig | 2 +- drivers/md/dm-init.c | 2 +- drivers/md/dm-raid.c | 2 +- 61 files changed, 4108 insertions(+), 3780 deletions(-) create mode 100644 Documentation/device-mapper/cache-policies.rst delete mode 100644 Documentation/device-mapper/cache-policies.txt create mode 100644 Documentation/device-mapper/cache.rst delete mode 100644 Documentation/device-mapper/cache.txt create mode 100644 Documentation/device-mapper/delay.rst delete mode 100644 Documentation/device-mapper/delay.txt create mode 100644 Documentation/device-mapper/dm-crypt.rst delete mode 100644 Documentation/device-mapper/dm-crypt.txt create mode 100644 Documentation/device-mapper/dm-flakey.rst delete mode 100644 Documentation/device-mapper/dm-flakey.txt create mode 100644 Documentation/device-mapper/dm-init.rst delete mode 100644 Documentation/device-mapper/dm-init.txt create mode 100644 Documentation/device-mapper/dm-integrity.rst delete mode 100644 Documentation/device-mapper/dm-integrity.txt create mode 100644 Documentation/device-mapper/dm-io.rst delete mode 100644 Documentation/device-mapper/dm-io.txt create mode 100644 Documentation/device-mapper/dm-log.rst delete mode 100644 Documentation/device-mapper/dm-log.txt create mode 100644 Documentation/device-mapper/dm-queue-length.rst delete mode 100644 Documentation/device-mapper/dm-queue-length.txt create mode 100644 Documentation/device-mapper/dm-raid.rst delete mode 100644 Documentation/device-mapper/dm-raid.txt create mode 100644 Documentation/device-mapper/dm-service-time.rst delete mode 100644 Documentation/device-mapper/dm-service-time.txt create mode 100644 Documentation/device-mapper/dm-uevent.rst delete mode 100644 Documentation/device-mapper/dm-uevent.txt create mode 100644 Documentation/device-mapper/dm-zoned.rst delete mode 100644 Documentation/device-mapper/dm-zoned.txt create mode 100644 Documentation/device-mapper/era.rst delete mode 100644 Documentation/device-mapper/era.txt create mode 100644 Documentation/device-mapper/index.rst create mode 100644 Documentation/device-mapper/kcopyd.rst delete mode 100644 Documentation/device-mapper/kcopyd.txt create mode 100644 Documentation/device-mapper/linear.rst delete mode 100644 Documentation/device-mapper/linear.txt create mode 100644 Documentation/device-mapper/log-writes.rst delete mode 100644 Documentation/device-mapper/log-writes.txt create mode 100644 Documentation/device-mapper/persistent-data.rst delete mode 100644 Documentation/device-mapper/persistent-data.txt create mode 100644 Documentation/device-mapper/snapshot.rst delete mode 100644 Documentation/device-mapper/snapshot.txt create mode 100644 Documentation/device-mapper/statistics.rst delete mode 100644 Documentation/device-mapper/statistics.txt create mode 100644 Documentation/device-mapper/striped.rst delete mode 100644 Documentation/device-mapper/striped.txt create mode 100644 Documentation/device-mapper/switch.rst delete mode 100644 Documentation/device-mapper/switch.txt create mode 100644 Documentation/device-mapper/thin-provisioning.rst delete mode 100644 Documentation/device-mapper/thin-provisioning.txt create mode 100644 Documentation/device-mapper/unstriped.rst delete mode 100644 Documentation/device-mapper/unstriped.txt create mode 100644 Documentation/device-mapper/verity.rst delete mode 100644 Documentation/device-mapper/verity.txt create mode 100644 Documentation/device-mapper/writecache.rst delete mode 100644 Documentation/device-mapper/writecache.txt create mode 100644 Documentation/device-mapper/zero.rst delete mode 100644 Documentation/device-mapper/zero.txt (limited to 'drivers') diff --git a/Documentation/device-mapper/cache-policies.rst b/Documentation/device-mapper/cache-policies.rst new file mode 100644 index 000000000000..b17fe352fc41 --- /dev/null +++ b/Documentation/device-mapper/cache-policies.rst @@ -0,0 +1,131 @@ +============================= +Guidance for writing policies +============================= + +Try to keep transactionality out of it. The core is careful to +avoid asking about anything that is migrating. This is a pain, but +makes it easier to write the policies. + +Mappings are loaded into the policy at construction time. + +Every bio that is mapped by the target is referred to the policy. +The policy can return a simple HIT or MISS or issue a migration. + +Currently there's no way for the policy to issue background work, +e.g. to start writing back dirty blocks that are going to be evicted +soon. + +Because we map bios, rather than requests it's easy for the policy +to get fooled by many small bios. For this reason the core target +issues periodic ticks to the policy. It's suggested that the policy +doesn't update states (eg, hit counts) for a block more than once +for each tick. The core ticks by watching bios complete, and so +trying to see when the io scheduler has let the ios run. + + +Overview of supplied cache replacement policies +=============================================== + +multiqueue (mq) +--------------- + +This policy is now an alias for smq (see below). + +The following tunables are accepted, but have no effect:: + + 'sequential_threshold <#nr_sequential_ios>' + 'random_threshold <#nr_random_ios>' + 'read_promote_adjustment ' + 'write_promote_adjustment ' + 'discard_promote_adjustment ' + +Stochastic multiqueue (smq) +--------------------------- + +This policy is the default. + +The stochastic multi-queue (smq) policy addresses some of the problems +with the multiqueue (mq) policy. + +The smq policy (vs mq) offers the promise of less memory utilization, +improved performance and increased adaptability in the face of changing +workloads. smq also does not have any cumbersome tuning knobs. + +Users may switch from "mq" to "smq" simply by appropriately reloading a +DM table that is using the cache target. Doing so will cause all of the +mq policy's hints to be dropped. Also, performance of the cache may +degrade slightly until smq recalculates the origin device's hotspots +that should be cached. + +Memory usage +^^^^^^^^^^^^ + +The mq policy used a lot of memory; 88 bytes per cache block on a 64 +bit machine. + +smq uses 28bit indexes to implement its data structures rather than +pointers. It avoids storing an explicit hit count for each block. It +has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of +the entries (each hotspot block covers a larger area than a single +cache block). + +All this means smq uses ~25bytes per cache block. Still a lot of +memory, but a substantial improvement nontheless. + +Level balancing +^^^^^^^^^^^^^^^ + +mq placed entries in different levels of the multiqueue structures +based on their hit count (~ln(hit count)). This meant the bottom +levels generally had the most entries, and the top ones had very +few. Having unbalanced levels like this reduced the efficacy of the +multiqueue. + +smq does not maintain a hit count, instead it swaps hit entries with +the least recently used entry from the level above. The overall +ordering being a side effect of this stochastic process. With this +scheme we can decide how many entries occupy each multiqueue level, +resulting in better promotion/demotion decisions. + +Adaptability: +The mq policy maintained a hit count for each cache block. For a +different block to get promoted to the cache its hit count has to +exceed the lowest currently in the cache. This meant it could take a +long time for the cache to adapt between varying IO patterns. + +smq doesn't maintain hit counts, so a lot of this problem just goes +away. In addition it tracks performance of the hotspot queue, which +is used to decide which blocks to promote. If the hotspot queue is +performing badly then it starts moving entries more quickly between +levels. This lets it adapt to new IO patterns very quickly. + +Performance +^^^^^^^^^^^ + +Testing smq shows substantially better performance than mq. + +cleaner +------- + +The cleaner writes back all dirty blocks in a cache to decommission it. + +Examples +======== + +The syntax for a table is:: + + cache + <#feature_args> []* + <#policy_args> []* + +The syntax to send a message using the dmsetup command is:: + + dmsetup message 0 sequential_threshold 1024 + dmsetup message 0 random_threshold 8 + +Using dmsetup:: + + dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \ + /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8" + creates a 128GB large mapped device named 'blah' with the + sequential threshold set to 1024 and the random_threshold set to 8. diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt deleted file mode 100644 index 86786d87d9a8..000000000000 --- a/Documentation/device-mapper/cache-policies.txt +++ /dev/null @@ -1,121 +0,0 @@ -Guidance for writing policies -============================= - -Try to keep transactionality out of it. The core is careful to -avoid asking about anything that is migrating. This is a pain, but -makes it easier to write the policies. - -Mappings are loaded into the policy at construction time. - -Every bio that is mapped by the target is referred to the policy. -The policy can return a simple HIT or MISS or issue a migration. - -Currently there's no way for the policy to issue background work, -e.g. to start writing back dirty blocks that are going to be evicted -soon. - -Because we map bios, rather than requests it's easy for the policy -to get fooled by many small bios. For this reason the core target -issues periodic ticks to the policy. It's suggested that the policy -doesn't update states (eg, hit counts) for a block more than once -for each tick. The core ticks by watching bios complete, and so -trying to see when the io scheduler has let the ios run. - - -Overview of supplied cache replacement policies -=============================================== - -multiqueue (mq) ---------------- - -This policy is now an alias for smq (see below). - -The following tunables are accepted, but have no effect: - - 'sequential_threshold <#nr_sequential_ios>' - 'random_threshold <#nr_random_ios>' - 'read_promote_adjustment ' - 'write_promote_adjustment ' - 'discard_promote_adjustment ' - -Stochastic multiqueue (smq) ---------------------------- - -This policy is the default. - -The stochastic multi-queue (smq) policy addresses some of the problems -with the multiqueue (mq) policy. - -The smq policy (vs mq) offers the promise of less memory utilization, -improved performance and increased adaptability in the face of changing -workloads. smq also does not have any cumbersome tuning knobs. - -Users may switch from "mq" to "smq" simply by appropriately reloading a -DM table that is using the cache target. Doing so will cause all of the -mq policy's hints to be dropped. Also, performance of the cache may -degrade slightly until smq recalculates the origin device's hotspots -that should be cached. - -Memory usage: -The mq policy used a lot of memory; 88 bytes per cache block on a 64 -bit machine. - -smq uses 28bit indexes to implement its data structures rather than -pointers. It avoids storing an explicit hit count for each block. It -has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of -the entries (each hotspot block covers a larger area than a single -cache block). - -All this means smq uses ~25bytes per cache block. Still a lot of -memory, but a substantial improvement nontheless. - -Level balancing: -mq placed entries in different levels of the multiqueue structures -based on their hit count (~ln(hit count)). This meant the bottom -levels generally had the most entries, and the top ones had very -few. Having unbalanced levels like this reduced the efficacy of the -multiqueue. - -smq does not maintain a hit count, instead it swaps hit entries with -the least recently used entry from the level above. The overall -ordering being a side effect of this stochastic process. With this -scheme we can decide how many entries occupy each multiqueue level, -resulting in better promotion/demotion decisions. - -Adaptability: -The mq policy maintained a hit count for each cache block. For a -different block to get promoted to the cache its hit count has to -exceed the lowest currently in the cache. This meant it could take a -long time for the cache to adapt between varying IO patterns. - -smq doesn't maintain hit counts, so a lot of this problem just goes -away. In addition it tracks performance of the hotspot queue, which -is used to decide which blocks to promote. If the hotspot queue is -performing badly then it starts moving entries more quickly between -levels. This lets it adapt to new IO patterns very quickly. - -Performance: -Testing smq shows substantially better performance than mq. - -cleaner -------- - -The cleaner writes back all dirty blocks in a cache to decommission it. - -Examples -======== - -The syntax for a table is: - cache - <#feature_args> []* - <#policy_args> []* - -The syntax to send a message using the dmsetup command is: - dmsetup message 0 sequential_threshold 1024 - dmsetup message 0 random_threshold 8 - -Using dmsetup: - dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \ - /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8" - creates a 128GB large mapped device named 'blah' with the - sequential threshold set to 1024 and the random_threshold set to 8. diff --git a/Documentation/device-mapper/cache.rst b/Documentation/device-mapper/cache.rst new file mode 100644 index 000000000000..f15e5254d05b --- /dev/null +++ b/Documentation/device-mapper/cache.rst @@ -0,0 +1,337 @@ +===== +Cache +===== + +Introduction +============ + +dm-cache is a device mapper target written by Joe Thornber, Heinz +Mauelshagen, and Mike Snitzer. + +It aims to improve performance of a block device (eg, a spindle) by +dynamically migrating some of its data to a faster, smaller device +(eg, an SSD). + +This device-mapper solution allows us to insert this caching at +different levels of the dm stack, for instance above the data device for +a thin-provisioning pool. Caching solutions that are integrated more +closely with the virtual memory system should give better performance. + +The target reuses the metadata library used in the thin-provisioning +library. + +The decision as to what data to migrate and when is left to a plug-in +policy module. Several of these have been written as we experiment, +and we hope other people will contribute others for specific io +scenarios (eg. a vm image server). + +Glossary +======== + + Migration + Movement of the primary copy of a logical block from one + device to the other. + Promotion + Migration from slow device to fast device. + Demotion + Migration from fast device to slow device. + +The origin device always contains a copy of the logical block, which +may be out of date or kept in sync with the copy on the cache device +(depending on policy). + +Design +====== + +Sub-devices +----------- + +The target is constructed by passing three devices to it (along with +other parameters detailed later): + +1. An origin device - the big, slow one. + +2. A cache device - the small, fast one. + +3. A small metadata device - records which blocks are in the cache, + which are dirty, and extra hints for use by the policy object. + This information could be put on the cache device, but having it + separate allows the volume manager to configure it differently, + e.g. as a mirror for extra robustness. This metadata device may only + be used by a single cache device. + +Fixed block size +---------------- + +The origin is divided up into blocks of a fixed size. This block size +is configurable when you first create the cache. Typically we've been +using block sizes of 256KB - 1024KB. The block size must be between 64 +sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). + +Having a fixed block size simplifies the target a lot. But it is +something of a compromise. For instance, a small part of a block may be +getting hit a lot, yet the whole block will be promoted to the cache. +So large block sizes are bad because they waste cache space. And small +block sizes are bad because they increase the amount of metadata (both +in core and on disk). + +Cache operating modes +--------------------- + +The cache has three operating modes: writeback, writethrough and +passthrough. + +If writeback, the default, is selected then a write to a block that is +cached will go only to the cache and the block will be marked dirty in +the metadata. + +If writethrough is selected then a write to a cached block will not +complete until it has hit both the origin and cache devices. Clean +blocks should remain clean. + +If passthrough is selected, useful when the cache contents are not known +to be coherent with the origin device, then all reads are served from +the origin device (all reads miss the cache) and all writes are +forwarded to the origin device; additionally, write hits cause cache +block invalidates. To enable passthrough mode the cache must be clean. +Passthrough mode allows a cache device to be activated without having to +worry about coherency. Coherency that exists is maintained, although +the cache will gradually cool as writes take place. If the coherency of +the cache can later be verified, or established through use of the +"invalidate_cblocks" message, the cache device can be transitioned to +writethrough or writeback mode while still warm. Otherwise, the cache +contents can be discarded prior to transitioning to the desired +operating mode. + +A simple cleaner policy is provided, which will clean (write back) all +dirty blocks in a cache. Useful for decommissioning a cache or when +shrinking a cache. Shrinking the cache's fast device requires all cache +blocks, in the area of the cache being removed, to be clean. If the +area being removed from the cache still contains dirty blocks the resize +will fail. Care must be taken to never reduce the volume used for the +cache's fast device until the cache is clean. This is of particular +importance if writeback mode is used. Writethrough and passthrough +modes already maintain a clean cache. Future support to partially clean +the cache, above a specified threshold, will allow for keeping the cache +warm and in writeback mode during resize. + +Migration throttling +-------------------- + +Migrating data between the origin and cache device uses bandwidth. +The user can set a throttle to prevent more than a certain amount of +migration occurring at any one time. Currently we're not taking any +account of normal io traffic going to the devices. More work needs +doing here to avoid migrating during those peak io moments. + +For the time being, a message "migration_threshold <#sectors>" +can be used to set the maximum number of sectors being migrated, +the default being 2048 sectors (1MB). + +Updating on-disk metadata +------------------------- + +On-disk metadata is committed every time a FLUSH or FUA bio is written. +If no such requests are made then commits will occur every second. This +means the cache behaves like a physical disk that has a volatile write +cache. If power is lost you may lose some recent writes. The metadata +should always be consistent in spite of any crash. + +The 'dirty' state for a cache block changes far too frequently for us +to keep updating it on the fly. So we treat it as a hint. In normal +operation it will be written when the dm device is suspended. If the +system crashes all cache blocks will be assumed dirty when restarted. + +Per-block policy hints +---------------------- + +Policy plug-ins can store a chunk of data per cache block. It's up to +the policy how big this chunk is, but it should be kept small. Like the +dirty flags this data is lost if there's a crash so a safe fallback +value should always be possible. + +Policy hints affect performance, not correctness. + +Policy messaging +---------------- + +Policies will have different tunables, specific to each one, so we +need a generic way of getting and setting these. Device-mapper +messages are used. Refer to cache-policies.txt. + +Discard bitset resolution +------------------------- + +We can avoid copying data during migration if we know the block has +been discarded. A prime example of this is when mkfs discards the +whole block device. We store a bitset tracking the discard state of +blocks. However, we allow this bitset to have a different block size +from the cache blocks. This is because we need to track the discard +state for all of the origin device (compare with the dirty bitset +which is just for the smaller cache device). + +Target interface +================ + +Constructor +----------- + + :: + + cache + <#feature args> []* + <#policy args> [policy args]* + + ================ ======================================================= + metadata dev fast device holding the persistent metadata + cache dev fast device holding cached data blocks + origin dev slow device holding original data blocks + block size cache unit size in sectors + + #feature args number of feature arguments passed + feature args writethrough or passthrough (The default is writeback.) + + policy the replacement policy to use + #policy args an even number of arguments corresponding to + key/value pairs passed to the policy + policy args key/value pairs passed to the policy + E.g. 'sequential_threshold 1024' + See cache-policies.txt for details. + ================ ======================================================= + +Optional feature arguments are: + + + ==================== ======================================================== + writethrough write through caching that prohibits cache block + content from being different from origin block content. + Without this argument, the default behaviour is to write + back cache block contents later for performance reasons, + so they may differ from the corresponding origin blocks. + + passthrough a degraded mode useful for various cache coherency + situations (e.g., rolling back snapshots of + underlying storage). Reads and writes always go to + the origin. If a write goes to a cached origin + block, then the cache block is invalidated. + To enable passthrough mode the cache must be clean. + + metadata2 use version 2 of the metadata. This stores the dirty + bits in a separate btree, which improves speed of + shutting down the cache. + + no_discard_passdown disable passing down discards from the cache + to the origin's data device. + ==================== ======================================================== + +A policy called 'default' is always registered. This is an alias for +the policy we currently think is giving best all round performance. + +As the default policy could vary between kernels, if you are relying on +the characteristics of a specific policy, always request it by name. + +Status +------ + +:: + + <#used metadata blocks>/<#total metadata blocks> + <#used cache blocks>/<#total cache blocks> + <#read hits> <#read misses> <#write hits> <#write misses> + <#demotions> <#promotions> <#dirty> <#features> * + <#core args> * <#policy args> * + + + +========================= ===================================================== +metadata block size Fixed block size for each metadata block in + sectors +#used metadata blocks Number of metadata blocks used +#total metadata blocks Total number of metadata blocks +cache block size Configurable block size for the cache device + in sectors +#used cache blocks Number of blocks resident in the cache +#total cache blocks Total number of cache blocks +#read hits Number of times a READ bio has been mapped + to the cache +#read misses Number of times a READ bio has been mapped + to the origin +#write hits Number of times a WRITE bio has been mapped + to the cache +#write misses Number of times a WRITE bio has been + mapped to the origin +#demotions Number of times a block has been removed + from the cache +#promotions Number of times a block has been moved to + the cache +#dirty Number of blocks in the cache that differ + from the origin +#feature args Number of feature args to follow +feature args 'writethrough' (optional) +#core args Number of core arguments (must be even) +core args Key/value pairs for tuning the core + e.g. migration_threshold +policy name Name of the policy +#policy args Number of policy arguments to follow (must be even) +policy args Key/value pairs e.g. sequential_threshold +cache metadata mode ro if read-only, rw if read-write + + In serious cases where even a read-only mode is + deemed unsafe no further I/O will be permitted and + the status will just contain the string 'Fail'. + The userspace recovery tools should then be used. +needs_check 'needs_check' if set, '-' if not set + A metadata operation has failed, resulting in the + needs_check flag being set in the metadata's + superblock. The metadata device must be + deactivated and checked/repaired before the + cache can be made fully operational again. + '-' indicates needs_check is not set. +========================= ===================================================== + +Messages +-------- + +Policies will have different tunables, specific to each one, so we +need a generic way of getting and setting these. Device-mapper +messages are used. (A sysfs interface would also be possible.) + +The message format is:: + + + +E.g.:: + + dmsetup message my_cache 0 sequential_threshold 1024 + + +Invalidation is removing an entry from the cache without writing it +back. Cache blocks can be invalidated via the invalidate_cblocks +message, which takes an arbitrary number of cblock ranges. Each cblock +range's end value is "one past the end", meaning 5-10 expresses a range +of values from 5 to 9. Each cblock must be expressed as a decimal +value, in the future a variant message that takes cblock ranges +expressed in hexadecimal may be needed to better support efficient +invalidation of larger caches. The cache must be in passthrough mode +when invalidate_cblocks is used:: + + invalidate_cblocks [|-]* + +E.g.:: + + dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 + +Examples +======== + +The test suite can be found here: + +https://github.com/jthornber/device-mapper-test-suite + +:: + + dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ + /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' + dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ + /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ + mq 4 sequential_threshold 1024 random_threshold 8' diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt deleted file mode 100644 index 8ae1cf8e94da..000000000000 --- a/Documentation/device-mapper/cache.txt +++ /dev/null @@ -1,311 +0,0 @@ -Introduction -============ - -dm-cache is a device mapper target written by Joe Thornber, Heinz -Mauelshagen, and Mike Snitzer. - -It aims to improve performance of a block device (eg, a spindle) by -dynamically migrating some of its data to a faster, smaller device -(eg, an SSD). - -This device-mapper solution allows us to insert this caching at -different levels of the dm stack, for instance above the data device for -a thin-provisioning pool. Caching solutions that are integrated more -closely with the virtual memory system should give better performance. - -The target reuses the metadata library used in the thin-provisioning -library. - -The decision as to what data to migrate and when is left to a plug-in -policy module. Several of these have been written as we experiment, -and we hope other people will contribute others for specific io -scenarios (eg. a vm image server). - -Glossary -======== - - Migration - Movement of the primary copy of a logical block from one - device to the other. - Promotion - Migration from slow device to fast device. - Demotion - Migration from fast device to slow device. - -The origin device always contains a copy of the logical block, which -may be out of date or kept in sync with the copy on the cache device -(depending on policy). - -Design -====== - -Sub-devices ------------ - -The target is constructed by passing three devices to it (along with -other parameters detailed later): - -1. An origin device - the big, slow one. - -2. A cache device - the small, fast one. - -3. A small metadata device - records which blocks are in the cache, - which are dirty, and extra hints for use by the policy object. - This information could be put on the cache device, but having it - separate allows the volume manager to configure it differently, - e.g. as a mirror for extra robustness. This metadata device may only - be used by a single cache device. - -Fixed block size ----------------- - -The origin is divided up into blocks of a fixed size. This block size -is configurable when you first create the cache. Typically we've been -using block sizes of 256KB - 1024KB. The block size must be between 64 -sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). - -Having a fixed block size simplifies the target a lot. But it is -something of a compromise. For instance, a small part of a block may be -getting hit a lot, yet the whole block will be promoted to the cache. -So large block sizes are bad because they waste cache space. And small -block sizes are bad because they increase the amount of metadata (both -in core and on disk). - -Cache operating modes ---------------------- - -The cache has three operating modes: writeback, writethrough and -passthrough. - -If writeback, the default, is selected then a write to a block that is -cached will go only to the cache and the block will be marked dirty in -the metadata. - -If writethrough is selected then a write to a cached block will not -complete until it has hit both the origin and cache devices. Clean -blocks should remain clean. - -If passthrough is selected, useful when the cache contents are not known -to be coherent with the origin device, then all reads are served from -the origin device (all reads miss the cache) and all writes are -forwarded to the origin device; additionally, write hits cause cache -block invalidates. To enable passthrough mode the cache must be clean. -Passthrough mode allows a cache device to be activated without having to -worry about coherency. Coherency that exists is maintained, although -the cache will gradually cool as writes take place. If the coherency of -the cache can later be verified, or established through use of the -"invalidate_cblocks" message, the cache device can be transitioned to -writethrough or writeback mode while still warm. Otherwise, the cache -contents can be discarded prior to transitioning to the desired -operating mode. - -A simple cleaner policy is provided, which will clean (write back) all -dirty blocks in a cache. Useful for decommissioning a cache or when -shrinking a cache. Shrinking the cache's fast device requires all cache -blocks, in the area of the cache being removed, to be clean. If the -area being removed from the cache still contains dirty blocks the resize -will fail. Care must be taken to never reduce the volume used for the -cache's fast device until the cache is clean. This is of particular -importance if writeback mode is used. Writethrough and passthrough -modes already maintain a clean cache. Future support to partially clean -the cache, above a specified threshold, will allow for keeping the cache -warm and in writeback mode during resize. - -Migration throttling --------------------- - -Migrating data between the origin and cache device uses bandwidth. -The user can set a throttle to prevent more than a certain amount of -migration occurring at any one time. Currently we're not taking any -account of normal io traffic going to the devices. More work needs -doing here to avoid migrating during those peak io moments. - -For the time being, a message "migration_threshold <#sectors>" -can be used to set the maximum number of sectors being migrated, -the default being 2048 sectors (1MB). - -Updating on-disk metadata -------------------------- - -On-disk metadata is committed every time a FLUSH or FUA bio is written. -If no such requests are made then commits will occur every second. This -means the cache behaves like a physical disk that has a volatile write -cache. If power is lost you may lose some recent writes. The metadata -should always be consistent in spite of any crash. - -The 'dirty' state for a cache block changes far too frequently for us -to keep updating it on the fly. So we treat it as a hint. In normal -operation it will be written when the dm device is suspended. If the -system crashes all cache blocks will be assumed dirty when restarted. - -Per-block policy hints ----------------------- - -Policy plug-ins can store a chunk of data per cache block. It's up to -the policy how big this chunk is, but it should be kept small. Like the -dirty flags this data is lost if there's a crash so a safe fallback -value should always be possible. - -Policy hints affect performance, not correctness. - -Policy messaging ----------------- - -Policies will have different tunables, specific to each one, so we -need a generic way of getting and setting these. Device-mapper -messages are used. Refer to cache-policies.txt. - -Discard bitset resolution -------------------------- - -We can avoid copying data during migration if we know the block has -been discarded. A prime example of this is when mkfs discards the -whole block device. We store a bitset tracking the discard state of -blocks. However, we allow this bitset to have a different block size -from the cache blocks. This is because we need to track the discard -state for all of the origin device (compare with the dirty bitset -which is just for the smaller cache device). - -Target interface -================ - -Constructor ------------ - - cache - <#feature args> []* - <#policy args> [policy args]* - - metadata dev : fast device holding the persistent metadata - cache dev : fast device holding cached data blocks - origin dev : slow device holding original data blocks - block size : cache unit size in sectors - - #feature args : number of feature arguments passed - feature args : writethrough or passthrough (The default is writeback.) - - policy : the replacement policy to use - #policy args : an even number of arguments corresponding to - key/value pairs passed to the policy - policy args : key/value pairs passed to the policy - E.g. 'sequential_threshold 1024' - See cache-policies.txt for details. - -Optional feature arguments are: - writethrough : write through caching that prohibits cache block - content from being different from origin block content. - Without this argument, the default behaviour is to write - back cache block contents later for performance reasons, - so they may differ from the corresponding origin blocks. - - passthrough : a degraded mode useful for various cache coherency - situations (e.g., rolling back snapshots of - underlying storage). Reads and writes always go to - the origin. If a write goes to a cached origin - block, then the cache block is invalidated. - To enable passthrough mode the cache must be clean. - - metadata2 : use version 2 of the metadata. This stores the dirty bits - in a separate btree, which improves speed of shutting - down the cache. - - no_discard_passdown : disable passing down discards from the cache - to the origin's data device. - -A policy called 'default' is always registered. This is an alias for -the policy we currently think is giving best all round performance. - -As the default policy could vary between kernels, if you are relying on -the characteristics of a specific policy, always request it by name. - -Status ------- - - <#used metadata blocks>/<#total metadata blocks> - <#used cache blocks>/<#total cache blocks> -<#read hits> <#read misses> <#write hits> <#write misses> -<#demotions> <#promotions> <#dirty> <#features> * -<#core args> * <#policy args> * - - -metadata block size : Fixed block size for each metadata block in - sectors -#used metadata blocks : Number of metadata blocks used -#total metadata blocks : Total number of metadata blocks -cache block size : Configurable block size for the cache device - in sectors -#used cache blocks : Number of blocks resident in the cache -#total cache blocks : Total number of cache blocks -#read hits : Number of times a READ bio has been mapped - to the cache -#read misses : Number of times a READ bio has been mapped - to the origin -#write hits : Number of times a WRITE bio has been mapped - to the cache -#write misses : Number of times a WRITE bio has been - mapped to the origin -#demotions : Number of times a block has been removed - from the cache -#promotions : Number of times a block has been moved to - the cache -#dirty : Number of blocks in the cache that differ - from the origin -#feature args : Number of feature args to follow -feature args : 'writethrough' (optional) -#core args : Number of core arguments (must be even) -core args : Key/value pairs for tuning the core - e.g. migration_threshold -policy name : Name of the policy -#policy args : Number of policy arguments to follow (must be even) -policy args : Key/value pairs e.g. sequential_threshold -cache metadata mode : ro if read-only, rw if read-write - In serious cases where even a read-only mode is deemed unsafe - no further I/O will be permitted and the status will just - contain the string 'Fail'. The userspace recovery tools - should then be used. -needs_check : 'needs_check' if set, '-' if not set - A metadata operation has failed, resulting in the needs_check - flag being set in the metadata's superblock. The metadata - device must be deactivated and checked/repaired before the - cache can be made fully operational again. '-' indicates - needs_check is not set. - -Messages --------- - -Policies will have different tunables, specific to each one, so we -need a generic way of getting and setting these. Device-mapper -messages are used. (A sysfs interface would also be possible.) - -The message format is: - - - -E.g. - dmsetup message my_cache 0 sequential_threshold 1024 - - -Invalidation is removing an entry from the cache without writing it -back. Cache blocks can be invalidated via the invalidate_cblocks -message, which takes an arbitrary number of cblock ranges. Each cblock -range's end value is "one past the end", meaning 5-10 expresses a range -of values from 5 to 9. Each cblock must be expressed as a decimal -value, in the future a variant message that takes cblock ranges -expressed in hexadecimal may be needed to better support efficient -invalidation of larger caches. The cache must be in passthrough mode -when invalidate_cblocks is used. - - invalidate_cblocks [|-]* - -E.g. - dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 - -Examples -======== - -The test suite can be found here: - -https://github.com/jthornber/device-mapper-test-suite - -dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ - /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' -dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ - /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ - mq 4 sequential_threshold 1024 random_threshold 8' diff --git a/Documentation/device-mapper/delay.rst b/Documentation/device-mapper/delay.rst new file mode 100644 index 000000000000..917ba8c33359 --- /dev/null +++ b/Documentation/device-mapper/delay.rst @@ -0,0 +1,31 @@ +======== +dm-delay +======== + +Device-Mapper's "delay" target delays reads and/or writes +and maps them to different devices. + +Parameters:: + + [ + [ ]] + +With separate write parameters, the first set is only used for reads. +Offsets are specified in sectors. +Delays are specified in milliseconds. + +Example scripts +=============== + +:: + + #!/bin/sh + # Create device delaying rw operation for 500ms + echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed + +:: + + #!/bin/sh + # Create device delaying only write operation for 500ms and + # splitting reads and writes to different devices $1 $2 + echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed diff --git a/Documentation/device-mapper/delay.txt b/Documentation/device-mapper/delay.txt deleted file mode 100644 index 6426c45273cb..000000000000 --- a/Documentation/device-mapper/delay.txt +++ /dev/null @@ -1,28 +0,0 @@ -dm-delay -======== - -Device-Mapper's "delay" target delays reads and/or writes -and maps them to different devices. - -Parameters: - [ - [ ]] - -With separate write parameters, the first set is only used for reads. -Offsets are specified in sectors. -Delays are specified in milliseconds. - -Example scripts -=============== -[[ -#!/bin/sh -# Create device delaying rw operation for 500ms -echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed -]] - -[[ -#!/bin/sh -# Create device delaying only write operation for 500ms and -# splitting reads and writes to different devices $1 $2 -echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed -]] diff --git a/Documentation/device-mapper/dm-crypt.rst b/Documentation/device-mapper/dm-crypt.rst new file mode 100644 index 000000000000..8f4a3f889d43 --- /dev/null +++ b/Documentation/device-mapper/dm-crypt.rst @@ -0,0 +1,173 @@ +======== +dm-crypt +======== + +Device-Mapper's "crypt" target provides transparent encryption of block devices +using the kernel crypto API. + +For a more detailed description of supported parameters see: +https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt + +Parameters:: + + \ + [<#opt_params> ] + + + Encryption cipher, encryption mode and Initial Vector (IV) generator. + + The cipher specifications format is:: + + cipher[:keycount]-chainmode-ivmode[:ivopts] + + Examples:: + + aes-cbc-essiv:sha256 + aes-xts-plain64 + serpent-xts-plain64 + + Cipher format also supports direct specification with kernel crypt API + format (selected by capi: prefix). The IV specification is the same + as for the first format type. + This format is mainly used for specification of authenticated modes. + + The crypto API cipher specifications format is:: + + capi:cipher_api_spec-ivmode[:ivopts] + + Examples:: + + capi:cbc(aes)-essiv:sha256 + capi:xts(aes)-plain64 + + Examples of authenticated modes:: + + capi:gcm(aes)-random + capi:authenc(hmac(sha256),xts(aes))-random + capi:rfc7539(chacha20,poly1305)-random + + The /proc/crypto contains a list of curently loaded crypto modes. + + + Key used for encryption. It is encoded either as a hexadecimal number + or it can be passed as prefixed with single colon + character (':') for keys residing in kernel keyring service. + You can only use key sizes that are valid for the selected cipher + in combination with the selected iv mode. + Note that for some iv modes the key string can contain additional + keys (for example IV seed) so the key contains more parts concatenated + into a single string. + + + The kernel keyring key is identified by string in following format: + ::. + + + The encryption key size in bytes. The kernel key payload size must match + the value passed in . + + + Either 'logon' or 'user' kernel key type. + + + The kernel keyring key description crypt target should look for + when loading key of . + + + Multi-key compatibility mode. You can define keys and + then sectors are encrypted according to their offsets (sector 0 uses key0; + sector 1 uses key1 etc.). must be a power of two. + + + The IV offset is a sector count that is added to the sector number + before creating the IV. + + + This is the device that is going to be used as backend and contains the + encrypted data. You can specify it as a path like /dev/xxx or a device + number :. + + + Starting sector within the device where the encrypted data begins. + +<#opt_params> + Number of optional parameters. If there are no optional parameters, + the optional paramaters section can be skipped or #opt_params can be zero. + Otherwise #opt_params is the number of following arguments. + + Example of optional parameters section: + 3 allow_discards same_cpu_crypt submit_from_crypt_cpus + +allow_discards + Block discard requests (a.k.a. TRIM) are passed through the crypt device. + The default is to ignore discard requests. + + WARNING: Assess the specific security risks carefully before enabling this + option. For example, allowing discards on encrypted devices may lead to + the leak of information about the ciphertext device (filesystem type, + used space etc.) if the discarded blocks can be located easily on the + device later. + +same_cpu_crypt + Perform encryption using the same cpu that IO was submitted on. + The default is to use an unbound workqueue so that encryption work + is automatically balanced between available CPUs. + +submit_from_crypt_cpus + Disable offloading writes to a separate thread after encryption. + There are some situations where offloading write bios from the + encryption threads to a single thread degrades performance + significantly. The default is to offload write bios to the same + thread because it benefits CFQ to have writes submitted using the + same context. + +integrity:: + The device requires additional metadata per-sector stored + in per-bio integrity structure. This metadata must by provided + by underlying dm-integrity target. + + The can be "none" if metadata is used only for persistent IV. + + For Authenticated Encryption with Additional Data (AEAD) + the is "aead". An AEAD mode additionally calculates and verifies + integrity for the encrypted device. The additional space is then + used for storing authentication tag (and persistent IV if needed). + +sector_size: + Use as the encryption unit instead of 512 bytes sectors. + This option can be in range 512 - 4096 bytes and must be power of two. + Virtual device will announce this size as a minimal IO and logical sector. + +iv_large_sectors + IV generators will use sector number counted in units + instead of default 512 bytes sectors. + + For example, if is 4096 bytes, plain64 IV for the second + sector will be 8 (without flag) and 1 if iv_large_sectors is present. + The must be multiple of (in 512 bytes units) + if this flag is specified. + +Example scripts +=============== +LUKS (Linux Unified Key Setup) is now the preferred way to set up disk +encryption with dm-crypt using the 'cryptsetup' utility, see +https://gitlab.com/cryptsetup/cryptsetup + +:: + + #!/bin/sh + # Create a crypt device using dmsetup + dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0" + +:: + + #!/bin/sh + # Create a crypt device using dmsetup when encryption key is stored in keyring service + dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0" + +:: + + #!/bin/sh + # Create a crypt device using cryptsetup and LUKS header with default cipher + cryptsetup luksFormat $1 + cryptsetup luksOpen $1 crypt1 diff --git a/Documentation/device-mapper/dm-crypt.txt b/Documentation/device-mapper/dm-crypt.txt deleted file mode 100644 index 3b3e1de21c9c..000000000000 --- a/Documentation/device-mapper/dm-crypt.txt +++ /dev/null @@ -1,162 +0,0 @@ -dm-crypt -========= - -Device-Mapper's "crypt" target provides transparent encryption of block devices -using the kernel crypto API. - -For a more detailed description of supported parameters see: -https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt - -Parameters: \ - [<#opt_params> ] - - - Encryption cipher, encryption mode and Initial Vector (IV) generator. - - The cipher specifications format is: - cipher[:keycount]-chainmode-ivmode[:ivopts] - Examples: - aes-cbc-essiv:sha256 - aes-xts-plain64 - serpent-xts-plain64 - - Cipher format also supports direct specification with kernel crypt API - format (selected by capi: prefix). The IV specification is the same - as for the first format type. - This format is mainly used for specification of authenticated modes. - - The crypto API cipher specifications format is: - capi:cipher_api_spec-ivmode[:ivopts] - Examples: - capi:cbc(aes)-essiv:sha256 - capi:xts(aes)-plain64 - Examples of authenticated modes: - capi:gcm(aes)-random - capi:authenc(hmac(sha256),xts(aes))-random - capi:rfc7539(chacha20,poly1305)-random - - The /proc/crypto contains a list of curently loaded crypto modes. - - - Key used for encryption. It is encoded either as a hexadecimal number - or it can be passed as prefixed with single colon - character (':') for keys residing in kernel keyring service. - You can only use key sizes that are valid for the selected cipher - in combination with the selected iv mode. - Note that for some iv modes the key string can contain additional - keys (for example IV seed) so the key contains more parts concatenated - into a single string. - - - The kernel keyring key is identified by string in following format: - ::. - - - The encryption key size in bytes. The kernel key payload size must match - the value passed in . - - - Either 'logon' or 'user' kernel key type. - - - The kernel keyring key description crypt target should look for - when loading key of . - - - Multi-key compatibility mode. You can define keys and - then sectors are encrypted according to their offsets (sector 0 uses key0; - sector 1 uses key1 etc.). must be a power of two. - - - The IV offset is a sector count that is added to the sector number - before creating the IV. - - - This is the device that is going to be used as backend and contains the - encrypted data. You can specify it as a path like /dev/xxx or a device - number :. - - - Starting sector within the device where the encrypted data begins. - -<#opt_params> - Number of optional parameters. If there are no optional parameters, - the optional paramaters section can be skipped or #opt_params can be zero. - Otherwise #opt_params is the number of following arguments. - - Example of optional parameters section: - 3 allow_discards same_cpu_crypt submit_from_crypt_cpus - -allow_discards - Block discard requests (a.k.a. TRIM) are passed through the crypt device. - The default is to ignore discard requests. - - WARNING: Assess the specific security risks carefully before enabling this - option. For example, allowing discards on encrypted devices may lead to - the leak of information about the ciphertext device (filesystem type, - used space etc.) if the discarded blocks can be located easily on the - device later. - -same_cpu_crypt - Perform encryption using the same cpu that IO was submitted on. - The default is to use an unbound workqueue so that encryption work - is automatically balanced between available CPUs. - -submit_from_crypt_cpus - Disable offloading writes to a separate thread after encryption. - There are some situations where offloading write bios from the - encryption threads to a single thread degrades performance - significantly. The default is to offload write bios to the same - thread because it benefits CFQ to have writes submitted using the - same context. - -integrity:: - The device requires additional metadata per-sector stored - in per-bio integrity structure. This metadata must by provided - by underlying dm-integrity target. - - The can be "none" if metadata is used only for persistent IV. - - For Authenticated Encryption with Additional Data (AEAD) - the is "aead". An AEAD mode additionally calculates and verifies - integrity for the encrypted device. The additional space is then - used for storing authentication tag (and persistent IV if needed). - -sector_size: - Use as the encryption unit instead of 512 bytes sectors. - This option can be in range 512 - 4096 bytes and must be power of two. - Virtual device will announce this size as a minimal IO and logical sector. - -iv_large_sectors - IV generators will use sector number counted in units - instead of default 512 bytes sectors. - - For example, if is 4096 bytes, plain64 IV for the second - sector will be 8 (without flag) and 1 if iv_large_sectors is present. - The must be multiple of (in 512 bytes units) - if this flag is specified. - -Example scripts -=============== -LUKS (Linux Unified Key Setup) is now the preferred way to set up disk -encryption with dm-crypt using the 'cryptsetup' utility, see -https://gitlab.com/cryptsetup/cryptsetup - -[[ -#!/bin/sh -# Create a crypt device using dmsetup -dmsetup create crypt1 --table "0 `blockdev --getsz $1` crypt aes-cbc-essiv:sha256 babebabebabebabebabebabebabebabe 0 $1 0" -]] - -[[ -#!/bin/sh -# Create a crypt device using dmsetup when encryption key is stored in keyring service -dmsetup create crypt2 --table "0 `blockdev --getsize $1` crypt aes-cbc-essiv:sha256 :32:logon:my_prefix:my_key 0 $1 0" -]] - -[[ -#!/bin/sh -# Create a crypt device using cryptsetup and LUKS header with default cipher -cryptsetup luksFormat $1 -cryptsetup luksOpen $1 crypt1 -]] diff --git a/Documentation/device-mapper/dm-flakey.rst b/Documentation/device-mapper/dm-flakey.rst new file mode 100644 index 000000000000..86138735879d --- /dev/null +++ b/Documentation/device-mapper/dm-flakey.rst @@ -0,0 +1,74 @@ +========= +dm-flakey +========= + +This target is the same as the linear target except that it exhibits +unreliable behaviour periodically. It's been found useful in simulating +failing devices for testing purposes. + +Starting from the time the table is loaded, the device is available for + seconds, then exhibits unreliable behaviour for seconds, and then this cycle repeats. + +Also, consider using this in combination with the dm-delay target too, +which can delay reads and writes and/or send them to different +underlying devices. + +Table parameters +---------------- + +:: + + \ + [ []] + +Mandatory parameters: + + : + Full pathname to the underlying block-device, or a + "major:minor" device-number. + : + Starting sector within the device. + : + Number of seconds device is available. + : + Number of seconds device returns errors. + +Optional feature parameters: + + If no feature parameters are present, during the periods of + unreliability, all I/O returns errors. + + drop_writes: + All write I/O is silently ignored. + Read I/O is handled correctly. + + error_writes: + All write I/O is failed with an error signalled. + Read I/O is handled correctly. + + corrupt_bio_byte : + During , replace of the data of + each matching bio with . + + : + The offset of the byte to replace. + Counting starts at 1, to replace the first byte. + : + Either 'r' to corrupt reads or 'w' to corrupt writes. + 'w' is incompatible with drop_writes. + : + The value (from 0-255) to write. + : + Perform the replacement only if bio->bi_opf has all the + selected flags set. + +Examples: + +Replaces the 32nd byte of READ bios with the value 1:: + + corrupt_bio_byte 32 r 1 0 + +Replaces the 224th byte of REQ_META (=32) bios with the value 0:: + + corrupt_bio_byte 224 w 0 32 diff --git a/Documentation/device-mapper/dm-flakey.txt b/Documentation/device-mapper/dm-flakey.txt deleted file mode 100644 index 9f0e247d0877..000000000000 --- a/Documentation/device-mapper/dm-flakey.txt +++ /dev/null @@ -1,57 +0,0 @@ -dm-flakey -========= - -This target is the same as the linear target except that it exhibits -unreliable behaviour periodically. It's been found useful in simulating -failing devices for testing purposes. - -Starting from the time the table is loaded, the device is available for - seconds, then exhibits unreliable behaviour for seconds, and then this cycle repeats. - -Also, consider using this in combination with the dm-delay target too, -which can delay reads and writes and/or send them to different -underlying devices. - -Table parameters ----------------- - \ - [ []] - -Mandatory parameters: - : Full pathname to the underlying block-device, or a - "major:minor" device-number. - : Starting sector within the device. - : Number of seconds device is available. - : Number of seconds device returns errors. - -Optional feature parameters: - If no feature parameters are present, during the periods of - unreliability, all I/O returns errors. - - drop_writes: - All write I/O is silently ignored. - Read I/O is handled correctly. - - error_writes: - All write I/O is failed with an error signalled. - Read I/O is handled correctly. - - corrupt_bio_byte : - During , replace of the data of - each matching bio with . - - : The offset of the byte to replace. - Counting starts at 1, to replace the first byte. - : Either 'r' to corrupt reads or 'w' to corrupt writes. - 'w' is incompatible with drop_writes. - : The value (from 0-255) to write. - : Perform the replacement only if bio->bi_opf has all the - selected flags set. - -Examples: - corrupt_bio_byte 32 r 1 0 - - replaces the 32nd byte of READ bios with the value 1 - - corrupt_bio_byte 224 w 0 32 - - replaces the 224th byte of REQ_META (=32) bios with the value 0 diff --git a/Documentation/device-mapper/dm-init.rst b/Documentation/device-mapper/dm-init.rst new file mode 100644 index 000000000000..e5242ff17e9b --- /dev/null +++ b/Documentation/device-mapper/dm-init.rst @@ -0,0 +1,125 @@ +================================ +Early creation of mapped devices +================================ + +It is possible to configure a device-mapper device to act as the root device for +your system in two ways. + +The first is to build an initial ramdisk which boots to a minimal userspace +which configures the device, then pivot_root(8) in to it. + +The second is to create one or more device-mappers using the module parameter +"dm-mod.create=" through the kernel boot command line argument. + +The format is specified as a string of data separated by commas and optionally +semi-colons, where: + + - a comma is used to separate fields like name, uuid, flags and table + (specifies one device) + - a semi-colon is used to separate devices. + +So the format will look like this:: + + dm-mod.create=,,,,[,
+][;,,,,
[,
+]+] + +Where:: + + ::= The device name. + ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | "" + ::= The device minor number | "" + ::= "ro" | "rw" +
::= + ::= "verity" | "linear" | ... (see list below) + +The dm line should be equivalent to the one used by the dmsetup tool with the +`--concise` argument. + +Target types +============ + +Not all target types are available as there are serious risks in allowing +activation of certain DM targets without first using userspace tools to check +the validity of associated metadata. + +======================= ======================================================= +`cache` constrained, userspace should verify cache device +`crypt` allowed +`delay` allowed +`era` constrained, userspace should verify metadata device +`flakey` constrained, meant for test +`linear` allowed +`log-writes` constrained, userspace should verify metadata device +`mirror` constrained, userspace should verify main/mirror device +`raid` constrained, userspace should verify metadata device +`snapshot` constrained, userspace should verify src/dst device +`snapshot-origin` allowed +`snapshot-merge` constrained, userspace should verify src/dst device +`striped` allowed +`switch` constrained, userspace should verify dev path +`thin` constrained, requires dm target message from userspace +`thin-pool` constrained, requires dm target message from userspace +`verity` allowed +`writecache` constrained, userspace should verify cache device +`zero` constrained, not meant for rootfs +======================= ======================================================= + +If the target is not listed above, it is constrained by default (not tested). + +Examples +======== +An example of booting to a linear array made up of user-mode linux block +devices:: + + dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0 + +This will boot to a rw dm-linear target of 8192 sectors split across two block +devices identified by their major:minor numbers. After boot, udev will rename +this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned. + +An example of multiple device-mappers, with the dm-mod.create="..." contents +is shown here split on multiple lines for readability:: + + dm-linear,,1,rw, + 0 32768 linear 8:1 0, + 32768 1024000 linear 8:2 0; + dm-verity,,3,ro, + 0 1638400 verity 1 /dev/sdc1 /dev/sdc2 4096 4096 204800 1 sha256 + ac87db56303c9c1da433d7209b5a6ef3e4779df141200cbd7c157dcb8dd89c42 + 5ebfe87f7df3235b80a117ebc4078e44f55045487ad4a96581d1adb564615b51 + +Other examples (per target): + +"crypt":: + + dm-crypt,,8,ro, + 0 1048576 crypt aes-xts-plain64 + babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0 + /dev/sda 0 1 allow_discards + +"delay":: + + dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500 + +"linear":: + + dm-linear,,,rw, + 0 32768 linear /dev/sda1 0, + 32768 1024000 linear /dev/sda2 0, + 1056768 204800 linear /dev/sda3 0, + 1261568 512000 linear /dev/sda4 0 + +"snapshot-origin":: + + dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2 + +"striped":: + + dm-striped,,4,ro,0 1638400 striped 4 4096 + /dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0 + +"verity":: + + dm-verity,,4,ro, + 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256 + fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd + 51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584 diff --git a/Documentation/device-mapper/dm-init.txt b/Documentation/device-mapper/dm-init.txt deleted file mode 100644 index 130b3c3679c5..000000000000 --- a/Documentation/device-mapper/dm-init.txt +++ /dev/null @@ -1,114 +0,0 @@ -Early creation of mapped devices -==================================== - -It is possible to configure a device-mapper device to act as the root device for -your system in two ways. - -The first is to build an initial ramdisk which boots to a minimal userspace -which configures the device, then pivot_root(8) in to it. - -The second is to create one or more device-mappers using the module parameter -"dm-mod.create=" through the kernel boot command line argument. - -The format is specified as a string of data separated by commas and optionally -semi-colons, where: - - a comma is used to separate fields like name, uuid, flags and table - (specifies one device) - - a semi-colon is used to separate devices. - -So the format will look like this: - - dm-mod.create=,,,,
[,
+][;,,,,
[,
+]+] - -Where, - ::= The device name. - ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | "" - ::= The device minor number | "" - ::= "ro" | "rw" -
::= - ::= "verity" | "linear" | ... (see list below) - -The dm line should be equivalent to the one used by the dmsetup tool with the ---concise argument. - -Target types -============ - -Not all target types are available as there are serious risks in allowing -activation of certain DM targets without first using userspace tools to check -the validity of associated metadata. - - "cache": constrained, userspace should verify cache device - "crypt": allowed - "delay": allowed - "era": constrained, userspace should verify metadata device - "flakey": constrained, meant for test - "linear": allowed - "log-writes": constrained, userspace should verify metadata device - "mirror": constrained, userspace should verify main/mirror device - "raid": constrained, userspace should verify metadata device - "snapshot": constrained, userspace should verify src/dst device - "snapshot-origin": allowed - "snapshot-merge": constrained, userspace should verify src/dst device - "striped": allowed - "switch": constrained, userspace should verify dev path - "thin": constrained, requires dm target message from userspace - "thin-pool": constrained, requires dm target message from userspace - "verity": allowed - "writecache": constrained, userspace should verify cache device - "zero": constrained, not meant for rootfs - -If the target is not listed above, it is constrained by default (not tested). - -Examples -======== -An example of booting to a linear array made up of user-mode linux block -devices: - - dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0 - -This will boot to a rw dm-linear target of 8192 sectors split across two block -devices identified by their major:minor numbers. After boot, udev will rename -this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned. - -An example of multiple device-mappers, with the dm-mod.create="..." contents is shown here -split on multiple lines for readability: - - dm-linear,,1,rw, - 0 32768 linear 8:1 0, - 32768 1024000 linear 8:2 0; - dm-verity,,3,ro, - 0 1638400 verity 1 /dev/sdc1 /dev/sdc2 4096 4096 204800 1 sha256 - ac87db56303c9c1da433d7209b5a6ef3e4779df141200cbd7c157dcb8dd89c42 - 5ebfe87f7df3235b80a117ebc4078e44f55045487ad4a96581d1adb564615b51 - -Other examples (per target): - -"crypt": - dm-crypt,,8,ro, - 0 1048576 crypt aes-xts-plain64 - babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0 - /dev/sda 0 1 allow_discards - -"delay": - dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500 - -"linear": - dm-linear,,,rw, - 0 32768 linear /dev/sda1 0, - 32768 1024000 linear /dev/sda2 0, - 1056768 204800 linear /dev/sda3 0, - 1261568 512000 linear /dev/sda4 0 - -"snapshot-origin": - dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2 - -"striped": - dm-striped,,4,ro,0 1638400 striped 4 4096 - /dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0 - -"verity": - dm-verity,,4,ro, - 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256 - fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd - 51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584 diff --git a/Documentation/device-mapper/dm-integrity.rst b/Documentation/device-mapper/dm-integrity.rst new file mode 100644 index 000000000000..a30aa91b5fbe --- /dev/null +++ b/Documentation/device-mapper/dm-integrity.rst @@ -0,0 +1,259 @@ +============ +dm-integrity +============ + +The dm-integrity target emulates a block device that has additional +per-sector tags that can be used for storing integrity information. + +A general problem with storing integrity tags with every sector is that +writing the sector and the integrity tag must be atomic - i.e. in case of +crash, either both sector and integrity tag or none of them is written. + +To guarantee write atomicity, the dm-integrity target uses journal, it +writes sector data and integrity tags into a journal, commits the journal +and then copies the data and integrity tags to their respective location. + +The dm-integrity target can be used with the dm-crypt target - in this +situation the dm-crypt target creates the integrity data and passes them +to the dm-integrity target via bio_integrity_payload attached to the bio. +In this mode, the dm-crypt and dm-integrity targets provide authenticated +disk encryption - if the attacker modifies the encrypted device, an I/O +error is returned instead of random data. + +The dm-integrity target can also be used as a standalone target, in this +mode it calculates and verifies the integrity tag internally. In this +mode, the dm-integrity target can be used to detect silent data +corruption on the disk or in the I/O path. + +There's an alternate mode of operation where dm-integrity uses bitmap +instead of a journal. If a bit in the bitmap is 1, the corresponding +region's data and integrity tags are not synchronized - if the machine +crashes, the unsynchronized regions will be recalculated. The bitmap mode +is faster than the journal mode, because we don't have to write the data +twice, but it is also less reliable, because if data corruption happens +when the machine crashes, it may not be detected. + +When loading the target for the first time, the kernel driver will format +the device. But it will only format the device if the superblock contains +zeroes. If the superblock is neither valid nor zeroed, the dm-integrity +target can't be loaded. + +To use the target for the first time: + +1. overwrite the superblock with zeroes +2. load the dm-integrity target with one-sector size, the kernel driver + will format the device +3. unload the dm-integrity target +4. read the "provided_data_sectors" value from the superblock +5. load the dm-integrity target with the the target size + "provided_data_sectors" +6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target + with the size "provided_data_sectors" + + +Target arguments: + +1. the underlying block device + +2. the number of reserved sector at the beginning of the device - the + dm-integrity won't read of write these sectors + +3. the size of the integrity tag (if "-" is used, the size is taken from + the internal-hash algorithm) + +4. mode: + + D - direct writes (without journal) + in this mode, journaling is + not used and data sectors and integrity tags are written + separately. In case of crash, it is possible that the data + and integrity tag doesn't match. + J - journaled writes + data and integrity tags are written to the + journal and atomicity is guaranteed. In case of crash, + either both data and tag or none of them are written. The + journaled mode degrades write throughput twice because the + data have to be written twice. + B - bitmap mode - data and metadata are written without any + synchronization, the driver maintains a bitmap of dirty + regions where data and metadata don't match. This mode can + only be used with internal hash. + R - recovery mode - in this mode, journal is not replayed, + checksums are not checked and writes to the device are not + allowed. This mode is useful for data recovery if the + device cannot be activated in any of the other standard + modes. + +5. the number of additional arguments + +Additional arguments: + +journal_sectors:number + The size of journal, this argument is used only if formatting the + device. If the device is already formatted, the value from the + superblock is used. + +interleave_sectors:number + The number of interleaved sectors. This values is rounded down to + a power of two. If the device is already formatted, the value from + the superblock is used. + +meta_device:device + Don't interleave the data and metadata on on device. Use a + separate device for metadata. + +buffer_sectors:number + The number of sectors in one buffer. The value is rounded down to + a power of two. + + The tag area is accessed using buffers, the buffer size is + configurable. The large buffer size means that the I/O size will + be larger, but there could be less I/Os issued. + +journal_watermark:number + The journal watermark in percents. When the size of the journal + exceeds this watermark, the thread that flushes the journal will + be started. + +commit_time:number + Commit time in milliseconds. When this time passes, the journal is + written. The journal is also written immediatelly if the FLUSH + request is received. + +internal_hash:algorithm(:key) (the key is optional) + Use internal hash or crc. + When this argument is used, the dm-integrity target won't accept + integrity tags from the upper target, but it will automatically + generate and verify the integrity tags. + + You can use a crc algorithm (such as crc32), then integrity target + will protect the data against accidental corruption. + You can also use a hmac algorithm (for example + "hmac(sha256):0123456789abcdef"), in this mode it will provide + cryptographic authentication of the data without encryption. + + When this argument is not used, the integrity tags are accepted + from an upper layer target, such as dm-crypt. The upper layer + target should check the validity of the integrity tags. + +recalculate + Recalculate the integrity tags automatically. It is only valid + when using internal hash. + +journal_crypt:algorithm(:key) (the key is optional) + Encrypt the journal using given algorithm to make sure that the + attacker can't read the journal. You can use a block cipher here + (such as "cbc(aes)") or a stream cipher (for example "chacha20", + "salsa20", "ctr(aes)" or "ecb(arc4)"). + + The journal contains history of last writes to the block device, + an attacker reading the journal could see the last sector nubmers + that were written. From the sector numbers, the attacker can infer + the size of files that were written. To protect against this + situation, you can encrypt the journal. + +journal_mac:algorithm(:key) (the key is optional) + Protect sector numbers in the journal from accidental or malicious + modification. To protect against accidental modification, use a + crc algorithm, to protect against malicious modification, use a + hmac algorithm with a key. + + This option is not needed when using internal-hash because in this + mode, the integrity of journal entries is checked when replaying + the journal. Thus, modified sector number would be detected at + this stage. + +block_size:number + The size of a data block in bytes. The larger the block size the + less overhead there is for per-block integrity metadata. + Supported values are 512, 1024, 2048 and 4096 bytes. If not + specified the default block size is 512 bytes. + +sectors_per_bit:number + In the bitmap mode, this parameter specifies the number of + 512-byte sectors that corresponds to one bitmap bit. + +bitmap_flush_interval:number + The bitmap flush interval in milliseconds. The metadata buffers + are synchronized when this interval expires. + + +The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can +be changed when reloading the target (load an inactive table and swap the +tables with suspend and resume). The other arguments should not be changed +when reloading the target because the layout of disk data depend on them +and the reloaded target would be non-functional. + + +The layout of the formatted block device: + +* reserved sectors + (they are not used by this target, they can be used for + storing LUKS metadata or for other purpose), the size of the reserved + area is specified in the target arguments + +* superblock (4kiB) + * magic string - identifies that the device was formatted + * version + * log2(interleave sectors) + * integrity tag size + * the number of journal sections + * provided data sectors - the number of sectors that this target + provides (i.e. the size of the device minus the size of all + metadata and padding). The user of this target should not send + bios that access data beyond the "provided data sectors" limit. + * flags + SB_FLAG_HAVE_JOURNAL_MAC + - a flag is set if journal_mac is used + SB_FLAG_RECALCULATING + - recalculating is in progress + SB_FLAG_DIRTY_BITMAP + - journal area contains the bitmap of dirty + blocks + * log2(sectors per block) + * a position where recalculating finished +* journal + The journal is divided into sections, each section contains: + + * metadata area (4kiB), it contains journal entries + + - every journal entry contains: + + * logical sector (specifies where the data and tag should + be written) + * last 8 bytes of data + * integrity tag (the size is specified in the superblock) + + - every metadata sector ends with + + * mac (8-bytes), all the macs in 8 metadata sectors form a + 64-byte value. It is used to store hmac of sector + numbers in the journal section, to protect against a + possibility that the attacker tampers with sector + numbers in the journal. + * commit id + + * data area (the size is variable; it depends on how many journal + entries fit into the metadata area) + + - every sector in the data area contains: + + * data (504 bytes of data, the last 8 bytes are stored in + the journal entry) + * commit id + + To test if the whole journal section was written correctly, every + 512-byte sector of the journal ends with 8-byte commit id. If the + commit id matches on all sectors in a journal section, then it is + assumed that the section was written correctly. If the commit id + doesn't match, the section was written partially and it should not + be replayed. + +* one or more runs of interleaved tags and data. + Each run contains: + + * tag area - it contains integrity tags. There is one tag for each + sector in the data area + * data area - it contains data sectors. The number of data sectors + in one run must be a power of two. log2 of this value is stored + in the superblock. diff --git a/Documentation/device-mapper/dm-integrity.txt b/Documentation/device-mapper/dm-integrity.txt deleted file mode 100644 index d63d78ffeb73..000000000000 --- a/Documentation/device-mapper/dm-integrity.txt +++ /dev/null @@ -1,233 +0,0 @@ -The dm-integrity target emulates a block device that has additional -per-sector tags that can be used for storing integrity information. - -A general problem with storing integrity tags with every sector is that -writing the sector and the integrity tag must be atomic - i.e. in case of -crash, either both sector and integrity tag or none of them is written. - -To guarantee write atomicity, the dm-integrity target uses journal, it -writes sector data and integrity tags into a journal, commits the journal -and then copies the data and integrity tags to their respective location. - -The dm-integrity target can be used with the dm-crypt target - in this -situation the dm-crypt target creates the integrity data and passes them -to the dm-integrity target via bio_integrity_payload attached to the bio. -In this mode, the dm-crypt and dm-integrity targets provide authenticated -disk encryption - if the attacker modifies the encrypted device, an I/O -error is returned instead of random data. - -The dm-integrity target can also be used as a standalone target, in this -mode it calculates and verifies the integrity tag internally. In this -mode, the dm-integrity target can be used to detect silent data -corruption on the disk or in the I/O path. - -There's an alternate mode of operation where dm-integrity uses bitmap -instead of a journal. If a bit in the bitmap is 1, the corresponding -region's data and integrity tags are not synchronized - if the machine -crashes, the unsynchronized regions will be recalculated. The bitmap mode -is faster than the journal mode, because we don't have to write the data -twice, but it is also less reliable, because if data corruption happens -when the machine crashes, it may not be detected. - -When loading the target for the first time, the kernel driver will format -the device. But it will only format the device if the superblock contains -zeroes. If the superblock is neither valid nor zeroed, the dm-integrity -target can't be loaded. - -To use the target for the first time: -1. overwrite the superblock with zeroes -2. load the dm-integrity target with one-sector size, the kernel driver - will format the device -3. unload the dm-integrity target -4. read the "provided_data_sectors" value from the superblock -5. load the dm-integrity target with the the target size - "provided_data_sectors" -6. if you want to use dm-integrity with dm-crypt, load the dm-crypt target - with the size "provided_data_sectors" - - -Target arguments: - -1. the underlying block device - -2. the number of reserved sector at the beginning of the device - the - dm-integrity won't read of write these sectors - -3. the size of the integrity tag (if "-" is used, the size is taken from - the internal-hash algorithm) - -4. mode: - D - direct writes (without journal) - in this mode, journaling is - not used and data sectors and integrity tags are written - separately. In case of crash, it is possible that the data - and integrity tag doesn't match. - J - journaled writes - data and integrity tags are written to the - journal and atomicity is guaranteed. In case of crash, - either both data and tag or none of them are written. The - journaled mode degrades write throughput twice because the - data have to be written twice. - B - bitmap mode - data and metadata are written without any - synchronization, the driver maintains a bitmap of dirty - regions where data and metadata don't match. This mode can - only be used with internal hash. - R - recovery mode - in this mode, journal is not replayed, - checksums are not checked and writes to the device are not - allowed. This mode is useful for data recovery if the - device cannot be activated in any of the other standard - modes. - -5. the number of additional arguments - -Additional arguments: - -journal_sectors:number - The size of journal, this argument is used only if formatting the - device. If the device is already formatted, the value from the - superblock is used. - -interleave_sectors:number - The number of interleaved sectors. This values is rounded down to - a power of two. If the device is already formatted, the value from - the superblock is used. - -meta_device:device - Don't interleave the data and metadata on on device. Use a - separate device for metadata. - -buffer_sectors:number - The number of sectors in one buffer. The value is rounded down to - a power of two. - - The tag area is accessed using buffers, the buffer size is - configurable. The large buffer size means that the I/O size will - be larger, but there could be less I/Os issued. - -journal_watermark:number - The journal watermark in percents. When the size of the journal - exceeds this watermark, the thread that flushes the journal will - be started. - -commit_time:number - Commit time in milliseconds. When this time passes, the journal is - written. The journal is also written immediatelly if the FLUSH - request is received. - -internal_hash:algorithm(:key) (the key is optional) - Use internal hash or crc. - When this argument is used, the dm-integrity target won't accept - integrity tags from the upper target, but it will automatically - generate and verify the integrity tags. - - You can use a crc algorithm (such as crc32), then integrity target - will protect the data against accidental corruption. - You can also use a hmac algorithm (for example - "hmac(sha256):0123456789abcdef"), in this mode it will provide - cryptographic authentication of the data without encryption. - - When this argument is not used, the integrity tags are accepted - from an upper layer target, such as dm-crypt. The upper layer - target should check the validity of the integrity tags. - -recalculate - Recalculate the integrity tags automatically. It is only valid - when using internal hash. - -journal_crypt:algorithm(:key) (the key is optional) - Encrypt the journal using given algorithm to make sure that the - attacker can't read the journal. You can use a block cipher here - (such as "cbc(aes)") or a stream cipher (for example "chacha20", - "salsa20", "ctr(aes)" or "ecb(arc4)"). - - The journal contains history of last writes to the block device, - an attacker reading the journal could see the last sector nubmers - that were written. From the sector numbers, the attacker can infer - the size of files that were written. To protect against this - situation, you can encrypt the journal. - -journal_mac:algorithm(:key) (the key is optional) - Protect sector numbers in the journal from accidental or malicious - modification. To protect against accidental modification, use a - crc algorithm, to protect against malicious modification, use a - hmac algorithm with a key. - - This option is not needed when using internal-hash because in this - mode, the integrity of journal entries is checked when replaying - the journal. Thus, modified sector number would be detected at - this stage. - -block_size:number - The size of a data block in bytes. The larger the block size the - less overhead there is for per-block integrity metadata. - Supported values are 512, 1024, 2048 and 4096 bytes. If not - specified the default block size is 512 bytes. - -sectors_per_bit:number - In the bitmap mode, this parameter specifies the number of - 512-byte sectors that corresponds to one bitmap bit. - -bitmap_flush_interval:number - The bitmap flush interval in milliseconds. The metadata buffers - are synchronized when this interval expires. - - -The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can -be changed when reloading the target (load an inactive table and swap the -tables with suspend and resume). The other arguments should not be changed -when reloading the target because the layout of disk data depend on them -and the reloaded target would be non-functional. - - -The layout of the formatted block device: -* reserved sectors (they are not used by this target, they can be used for - storing LUKS metadata or for other purpose), the size of the reserved - area is specified in the target arguments -* superblock (4kiB) - * magic string - identifies that the device was formatted - * version - * log2(interleave sectors) - * integrity tag size - * the number of journal sections - * provided data sectors - the number of sectors that this target - provides (i.e. the size of the device minus the size of all - metadata and padding). The user of this target should not send - bios that access data beyond the "provided data sectors" limit. - * flags - SB_FLAG_HAVE_JOURNAL_MAC - a flag is set if journal_mac is used - SB_FLAG_RECALCULATING - recalculating is in progress - SB_FLAG_DIRTY_BITMAP - journal area contains the bitmap of dirty - blocks - * log2(sectors per block) - * a position where recalculating finished -* journal - The journal is divided into sections, each section contains: - * metadata area (4kiB), it contains journal entries - every journal entry contains: - * logical sector (specifies where the data and tag should - be written) - * last 8 bytes of data - * integrity tag (the size is specified in the superblock) - every metadata sector ends with - * mac (8-bytes), all the macs in 8 metadata sectors form a - 64-byte value. It is used to store hmac of sector - numbers in the journal section, to protect against a - possibility that the attacker tampers with sector - numbers in the journal. - * commit id - * data area (the size is variable; it depends on how many journal - entries fit into the metadata area) - every sector in the data area contains: - * data (504 bytes of data, the last 8 bytes are stored in - the journal entry) - * commit id - To test if the whole journal section was written correctly, every - 512-byte sector of the journal ends with 8-byte commit id. If the - commit id matches on all sectors in a journal section, then it is - assumed that the section was written correctly. If the commit id - doesn't match, the section was written partially and it should not - be replayed. -* one or more runs of interleaved tags and data. Each run contains: - * tag area - it contains integrity tags. There is one tag for each - sector in the data area - * data area - it contains data sectors. The number of data sectors - in one run must be a power of two. log2 of this value is stored - in the superblock. diff --git a/Documentation/device-mapper/dm-io.rst b/Documentation/device-mapper/dm-io.rst new file mode 100644 index 000000000000..d2492917a1f5 --- /dev/null +++ b/Documentation/device-mapper/dm-io.rst @@ -0,0 +1,75 @@ +===== +dm-io +===== + +Dm-io provides synchronous and asynchronous I/O services. There are three +types of I/O services available, and each type has a sync and an async +version. + +The user must set up an io_region structure to describe the desired location +of the I/O. Each io_region indicates a block-device along with the starting +sector and size of the region:: + + struct io_region { + struct block_device *bdev; + sector_t sector; + sector_t count; + }; + +Dm-io can read from one io_region or write to one or more io_regions. Writes +to multiple regions are specified by an array of io_region structures. + +The first I/O service type takes a list of memory pages as the data buffer for +the I/O, along with an offset into the first page:: + + struct page_list { + struct page_list *next; + struct page *page; + }; + + int dm_io_sync(unsigned int num_regions, struct io_region *where, int rw, + struct page_list *pl, unsigned int offset, + unsigned long *error_bits); + int dm_io_async(unsigned int num_regions, struct io_region *where, int rw, + struct page_list *pl, unsigned int offset, + io_notify_fn fn, void *context); + +The second I/O service type takes an array of bio vectors as the data buffer +for the I/O. This service can be handy if the caller has a pre-assembled bio, +but wants to direct different portions of the bio to different devices:: + + int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where, + int rw, struct bio_vec *bvec, + unsigned long *error_bits); + int dm_io_async_bvec(unsigned int num_regions, struct io_region *where, + int rw, struct bio_vec *bvec, + io_notify_fn fn, void *context); + +The third I/O service type takes a pointer to a vmalloc'd memory buffer as the +data buffer for the I/O. This service can be handy if the caller needs to do +I/O to a large region but doesn't want to allocate a large number of individual +memory pages:: + + int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw, + void *data, unsigned long *error_bits); + int dm_io_async_vm(unsigned int num_regions, struct io_region *where, int rw, + void *data, io_notify_fn fn, void *context); + +Callers of the asynchronous I/O services must include the name of a completion +callback routine and a pointer to some context data for the I/O:: + + typedef void (*io_notify_fn)(unsigned long error, void *context); + +The "error" parameter in this callback, as well as the `*error` parameter in +all of the synchronous versions, is a bitset (instead of a simple error value). +In the case of an write-I/O to multiple regions, this bitset allows dm-io to +indicate success or failure on each individual region. + +Before using any of the dm-io services, the user should call dm_io_get() +and specify the number of pages they expect to perform I/O on concurrently. +Dm-io will attempt to resize its mempool to make sure enough pages are +always available in order to avoid unnecessary waiting while performing I/O. + +When the user is finished using the dm-io services, they should call +dm_io_put() and specify the same number of pages that were given on the +dm_io_get() call. diff --git a/Documentation/device-mapper/dm-io.txt b/Documentation/device-mapper/dm-io.txt deleted file mode 100644 index 3b5d9a52cdcf..000000000000 --- a/Documentation/device-mapper/dm-io.txt +++ /dev/null @@ -1,75 +0,0 @@ -dm-io -===== - -Dm-io provides synchronous and asynchronous I/O services. There are three -types of I/O services available, and each type has a sync and an async -version. - -The user must set up an io_region structure to describe the desired location -of the I/O. Each io_region indicates a block-device along with the starting -sector and size of the region. - - struct io_region { - struct block_device *bdev; - sector_t sector; - sector_t count; - }; - -Dm-io can read from one io_region or write to one or more io_regions. Writes -to multiple regions are specified by an array of io_region structures. - -The first I/O service type takes a list of memory pages as the data buffer for -the I/O, along with an offset into the first page. - - struct page_list { - struct page_list *next; - struct page *page; - }; - - int dm_io_sync(unsigned int num_regions, struct io_region *where, int rw, - struct page_list *pl, unsigned int offset, - unsigned long *error_bits); - int dm_io_async(unsigned int num_regions, struct io_region *where, int rw, - struct page_list *pl, unsigned int offset, - io_notify_fn fn, void *context); - -The second I/O service type takes an array of bio vectors as the data buffer -for the I/O. This service can be handy if the caller has a pre-assembled bio, -but wants to direct different portions of the bio to different devices. - - int dm_io_sync_bvec(unsigned int num_regions, struct io_region *where, - int rw, struct bio_vec *bvec, - unsigned long *error_bits); - int dm_io_async_bvec(unsigned int num_regions, struct io_region *where, - int rw, struct bio_vec *bvec, - io_notify_fn fn, void *context); - -The third I/O service type takes a pointer to a vmalloc'd memory buffer as the -data buffer for the I/O. This service can be handy if the caller needs to do -I/O to a large region but doesn't want to allocate a large number of individual -memory pages. - - int dm_io_sync_vm(unsigned int num_regions, struct io_region *where, int rw, - void *data, unsigned long *error_bits); - int dm_io_async_vm(unsigned int num_regions, struct io_region *where, int rw, - void *data, io_notify_fn fn, void *context); - -Callers of the asynchronous I/O services must include the name of a completion -callback routine and a pointer to some context data for the I/O. - - typedef void (*io_notify_fn)(unsigned long error, void *context); - -The "error" parameter in this callback, as well as the "*error" parameter in -all of the synchronous versions, is a bitset (instead of a simple error value). -In the case of an write-I/O to multiple regions, this bitset allows dm-io to -indicate success or failure on each individual region. - -Before using any of the dm-io services, the user should call dm_io_get() -and specify the number of pages they expect to perform I/O on concurrently. -Dm-io will attempt to resize its mempool to make sure enough pages are -always available in order to avoid unnecessary waiting while performing I/O. - -When the user is finished using the dm-io services, they should call -dm_io_put() and specify the same number of pages that were given on the -dm_io_get() call. - diff --git a/Documentation/device-mapper/dm-log.rst b/Documentation/device-mapper/dm-log.rst new file mode 100644 index 000000000000..ba4fce39bc27 --- /dev/null +++ b/Documentation/device-mapper/dm-log.rst @@ -0,0 +1,57 @@ +===================== +Device-Mapper Logging +===================== +The device-mapper logging code is used by some of the device-mapper +RAID targets to track regions of the disk that are not consistent. +A region (or portion of the address space) of the disk may be +inconsistent because a RAID stripe is currently being operated on or +a machine died while the region was being altered. In the case of +mirrors, a region would be considered dirty/inconsistent while you +are writing to it because the writes need to be replicated for all +the legs of the mirror and may not reach the legs at the same time. +Once all writes are complete, the region is considered clean again. + +There is a generic logging interface that the device-mapper RAID +implementations use to perform logging operations (see +dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different +logging implementations are available and provide different +capabilities. The list includes: + +============== ============================================================== +Type Files +============== ============================================================== +disk drivers/md/dm-log.c +core drivers/md/dm-log.c +userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h +============== ============================================================== + +The "disk" log type +------------------- +This log implementation commits the log state to disk. This way, the +logging state survives reboots/crashes. + +The "core" log type +------------------- +This log implementation keeps the log state in memory. The log state +will not survive a reboot or crash, but there may be a small boost in +performance. This method can also be used if no storage device is +available for storing log state. + +The "userspace" log type +------------------------ +This log type simply provides a way to export the log API to userspace, +so log implementations can be done there. This is done by forwarding most +logging requests to userspace, where a daemon receives and processes the +request. + +The structure used for communication between kernel and userspace are +located in include/linux/dm-log-userspace.h. Due to the frequency, +diversity, and 2-way communication nature of the exchanges between +kernel and userspace, 'connector' is used as the interface for +communication. + +There are currently two userspace log implementations that leverage this +framework - "clustered-disk" and "clustered-core". These implementations +provide a cluster-coherent log for shared-storage. Device-mapper mirroring +can be used in a shared-storage environment when the cluster log implementations +are employed. diff --git a/Documentation/device-mapper/dm-log.txt b/Documentation/device-mapper/dm-log.txt deleted file mode 100644 index c155ac569c44..000000000000 --- a/Documentation/device-mapper/dm-log.txt +++ /dev/null @@ -1,54 +0,0 @@ -Device-Mapper Logging -===================== -The device-mapper logging code is used by some of the device-mapper -RAID targets to track regions of the disk that are not consistent. -A region (or portion of the address space) of the disk may be -inconsistent because a RAID stripe is currently being operated on or -a machine died while the region was being altered. In the case of -mirrors, a region would be considered dirty/inconsistent while you -are writing to it because the writes need to be replicated for all -the legs of the mirror and may not reach the legs at the same time. -Once all writes are complete, the region is considered clean again. - -There is a generic logging interface that the device-mapper RAID -implementations use to perform logging operations (see -dm_dirty_log_type in include/linux/dm-dirty-log.h). Various different -logging implementations are available and provide different -capabilities. The list includes: - -Type Files -==== ===== -disk drivers/md/dm-log.c -core drivers/md/dm-log.c -userspace drivers/md/dm-log-userspace* include/linux/dm-log-userspace.h - -The "disk" log type -------------------- -This log implementation commits the log state to disk. This way, the -logging state survives reboots/crashes. - -The "core" log type -------------------- -This log implementation keeps the log state in memory. The log state -will not survive a reboot or crash, but there may be a small boost in -performance. This method can also be used if no storage device is -available for storing log state. - -The "userspace" log type ------------------------- -This log type simply provides a way to export the log API to userspace, -so log implementations can be done there. This is done by forwarding most -logging requests to userspace, where a daemon receives and processes the -request. - -The structure used for communication between kernel and userspace are -located in include/linux/dm-log-userspace.h. Due to the frequency, -diversity, and 2-way communication nature of the exchanges between -kernel and userspace, 'connector' is used as the interface for -communication. - -There are currently two userspace log implementations that leverage this -framework - "clustered-disk" and "clustered-core". These implementations -provide a cluster-coherent log for shared-storage. Device-mapper mirroring -can be used in a shared-storage environment when the cluster log implementations -are employed. diff --git a/Documentation/device-mapper/dm-queue-length.rst b/Documentation/device-mapper/dm-queue-length.rst new file mode 100644 index 000000000000..d8e381c1cb02 --- /dev/null +++ b/Documentation/device-mapper/dm-queue-length.rst @@ -0,0 +1,48 @@ +=============== +dm-queue-length +=============== + +dm-queue-length is a path selector module for device-mapper targets, +which selects a path with the least number of in-flight I/Os. +The path selector name is 'queue-length'. + +Table parameters for each path: [] + +:: + + : The number of I/Os to dispatch using the selected + path before switching to the next path. + If not given, internal default is used. To check + the default value, see the activated table. + +Status for each path: + +:: + + : 'A' if the path is active, 'F' if the path is failed. + : The number of path failures. + : The number of in-flight I/Os on the path. + + +Algorithm +========= + +dm-queue-length increments/decrements 'in-flight' when an I/O is +dispatched/completed respectively. +dm-queue-length selects a path with the minimum 'in-flight'. + + +Examples +======== +In case that 2 paths (sda and sdb) are used with repeat_count == 128. + +:: + + # echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \ + dmsetup create test + # + # dmsetup table + test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128 + # + # dmsetup status + test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0 diff --git a/Documentation/device-mapper/dm-queue-length.txt b/Documentation/device-mapper/dm-queue-length.txt deleted file mode 100644 index f4db2562175c..000000000000 --- a/Documentation/device-mapper/dm-queue-length.txt +++ /dev/null @@ -1,39 +0,0 @@ -dm-queue-length -=============== - -dm-queue-length is a path selector module for device-mapper targets, -which selects a path with the least number of in-flight I/Os. -The path selector name is 'queue-length'. - -Table parameters for each path: [] - : The number of I/Os to dispatch using the selected - path before switching to the next path. - If not given, internal default is used. To check - the default value, see the activated table. - -Status for each path: - : 'A' if the path is active, 'F' if the path is failed. - : The number of path failures. - : The number of in-flight I/Os on the path. - - -Algorithm -========= - -dm-queue-length increments/decrements 'in-flight' when an I/O is -dispatched/completed respectively. -dm-queue-length selects a path with the minimum 'in-flight'. - - -Examples -======== -In case that 2 paths (sda and sdb) are used with repeat_count == 128. - -# echo "0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128" \ - dmsetup create test -# -# dmsetup table -test: 0 10 multipath 0 0 1 1 queue-length 0 2 1 8:0 128 8:16 128 -# -# dmsetup status -test: 0 10 multipath 2 0 0 0 1 1 E 0 2 1 8:0 A 0 0 8:16 A 0 0 diff --git a/Documentation/device-mapper/dm-raid.rst b/Documentation/device-mapper/dm-raid.rst new file mode 100644 index 000000000000..2fe255b130fb --- /dev/null +++ b/Documentation/device-mapper/dm-raid.rst @@ -0,0 +1,419 @@ +======= +dm-raid +======= + +The device-mapper RAID (dm-raid) target provides a bridge from DM to MD. +It allows the MD RAID drivers to be accessed using a device-mapper +interface. + + +Mapping Table Interface +----------------------- +The target is named "raid" and it accepts the following parameters:: + + <#raid_params> \ + <#raid_devs> [.. ] + +: + + ============= =============================================================== + raid0 RAID0 striping (no resilience) + raid1 RAID1 mirroring + raid4 RAID4 with dedicated last parity disk + raid5_n RAID5 with dedicated last parity disk supporting takeover + Same as raid4 + + - Transitory layout + raid5_la RAID5 left asymmetric + + - rotating parity 0 with data continuation + raid5_ra RAID5 right asymmetric + + - rotating parity N with data continuation + raid5_ls RAID5 left symmetric + + - rotating parity 0 with data restart + raid5_rs RAID5 right symmetric + + - rotating parity N with data restart + raid6_zr RAID6 zero restart + + - rotating parity zero (left-to-right) with data restart + raid6_nr RAID6 N restart + + - rotating parity N (right-to-left) with data restart + raid6_nc RAID6 N continue + + - rotating parity N (right-to-left) with data continuation + raid6_n_6 RAID6 with dedicate parity disks + + - parity and Q-syndrome on the last 2 disks; + layout for takeover from/to raid4/raid5_n + raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk + + - layout for takeover from raid5_la from/to raid6 + raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk + + - layout for takeover from raid5_ra from/to raid6 + raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk + + - layout for takeover from raid5_ls from/to raid6 + raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk + + - layout for takeover from raid5_rs from/to raid6 + raid10 Various RAID10 inspired algorithms chosen by additional params + (see raid10_format and raid10_copies below) + + - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') + - RAID1E: Integrated Adjacent Stripe Mirroring + - RAID1E: Integrated Offset Stripe Mirroring + - and other similar RAID10 variants + ============= =============================================================== + + Reference: Chapter 4 of + http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf + +<#raid_params>: The number of parameters that follow. + + consists of + + Mandatory parameters: + : + Chunk size in sectors. This parameter is often known as + "stripe size". It is the only mandatory parameter and + is placed first. + + followed by optional parameters (in any order): + [sync|nosync] + Force or prevent RAID initialization. + + [rebuild ] + Rebuild drive number 'idx' (first drive is 0). + + [daemon_sleep ] + Interval between runs of the bitmap daemon that + clear bits. A longer interval means less bitmap I/O but + resyncing after a failure is likely to take longer. + + [min_recovery_rate ] + Throttle RAID initialization + [max_recovery_rate ] + Throttle RAID initialization + [write_mostly ] + Mark drive index 'idx' write-mostly. + [max_write_behind ] + See '--write-behind=' (man mdadm) + [stripe_cache ] + Stripe cache size (RAID 4/5/6 only) + [region_size ] + The region_size multiplied by the number of regions is the + logical size of the array. The bitmap records the device + synchronisation state for each region. + + [raid10_copies <# copies>], [raid10_format ] + These two options are used to alter the default layout of + a RAID10 configuration. The number of copies is can be + specified, but the default is 2. There are also three + variations to how the copies are laid down - the default + is "near". Near copies are what most people think of with + respect to mirroring. If these options are left unspecified, + or 'raid10_copies 2' and/or 'raid10_format near' are given, + then the layouts for 2, 3 and 4 devices are: + + ======== ========== ============== + 2 drives 3 drives 4 drives + ======== ========== ============== + A1 A1 A1 A1 A2 A1 A1 A2 A2 + A2 A2 A2 A3 A3 A3 A3 A4 A4 + A3 A3 A4 A4 A5 A5 A5 A6 A6 + A4 A4 A5 A6 A6 A7 A7 A8 A8 + .. .. .. .. .. .. .. .. .. + ======== ========== ============== + + The 2-device layout is equivalent 2-way RAID1. The 4-device + layout is what a traditional RAID10 would look like. The + 3-device layout is what might be called a 'RAID1E - Integrated + Adjacent Stripe Mirroring'. + + If 'raid10_copies 2' and 'raid10_format far', then the layouts + for 2, 3 and 4 devices are: + + ======== ============ =================== + 2 drives 3 drives 4 drives + ======== ============ =================== + A1 A2 A1 A2 A3 A1 A2 A3 A4 + A3 A4 A4 A5 A6 A5 A6 A7 A8 + A5 A6 A7 A8 A9 A9 A10 A11 A12 + .. .. .. .. .. .. .. .. .. + A2 A1 A3 A1 A2 A2 A1 A4 A3 + A4 A3 A6 A4 A5 A6 A5 A8 A7 + A6 A5 A9 A7 A8 A10 A9 A12 A11 + .. .. .. .. .. .. .. .. .. + ======== ============ =================== + + If 'raid10_copies 2' and 'raid10_format offset', then the + layouts for 2, 3 and 4 devices are: + + ======== ========== ================ + 2 drives 3 drives 4 drives + ======== ========== ================ + A1 A2 A1 A2 A3 A1 A2 A3 A4 + A2 A1 A3 A1 A2 A2 A1 A4 A3 + A3 A4 A4 A5 A6 A5 A6 A7 A8 + A4 A3 A6 A4 A5 A6 A5 A8 A7 + A5 A6 A7 A8 A9 A9 A10 A11 A12 + A6 A5 A9 A7 A8 A10 A9 A12 A11 + .. .. .. .. .. .. .. .. .. + ======== ========== ================ + + Here we see layouts closely akin to 'RAID1E - Integrated + Offset Stripe Mirroring'. + + [delta_disks ] + The delta_disks option value (-251 < N < +251) triggers + device removal (negative value) or device addition (positive + value) to any reshape supporting raid levels 4/5/6 and 10. + RAID levels 4/5/6 allow for addition of devices (metadata + and data device tuple), raid10_near and raid10_offset only + allow for device addition. raid10_far does not support any + reshaping at all. + A minimum of devices have to be kept to enforce resilience, + which is 3 devices for raid4/5 and 4 devices for raid6. + + [data_offset ] + This option value defines the offset into each data device + where the data starts. This is used to provide out-of-place + reshaping space to avoid writing over data while + changing the layout of stripes, hence an interruption/crash + may happen at any time without the risk of losing data. + E.g. when adding devices to an existing raid set during + forward reshaping, the out-of-place space will be allocated + at the beginning of each raid device. The kernel raid4/5/6/10 + MD personalities supporting such device addition will read the data from + the existing first stripes (those with smaller number of stripes) + starting at data_offset to fill up a new stripe with the larger + number of stripes, calculate the redundancy blocks (CRC/Q-syndrome) + and write that new stripe to offset 0. Same will be applied to all + N-1 other new stripes. This out-of-place scheme is used to change + the RAID type (i.e. the allocation algorithm) as well, e.g. + changing from raid5_ls to raid5_n. + + [journal_dev ] + This option adds a journal device to raid4/5/6 raid sets and + uses it to close the 'write hole' caused by the non-atomic updates + to the component devices which can cause data loss during recovery. + The journal device is used as writethrough thus causing writes to + be throttled versus non-journaled raid4/5/6 sets. + Takeover/reshape is not possible with a raid4/5/6 journal device; + it has to be deconfigured before requesting these. + + [journal_mode ] + This option sets the caching mode on journaled raid4/5/6 raid sets + (see 'journal_dev ' above) to 'writethrough' or 'writeback'. + If 'writeback' is selected the journal device has to be resilient + and must not suffer from the 'write hole' problem itself (e.g. use + raid1 or raid10) to avoid a single point of failure. + +<#raid_devs>: The number of devices composing the array. + Each device consists of two entries. The first is the device + containing the metadata (if any); the second is the one containing the + data. A Maximum of 64 metadata/data device entries are supported + up to target version 1.8.0. + 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime. + + If a drive has failed or is missing at creation time, a '-' can be + given for both the metadata and data drives for a given position. + + +Example Tables +-------------- + +:: + + # RAID4 - 4 data drives, 1 parity (no metadata devices) + # No metadata devices specified to hold superblock/bitmap info + # Chunk size of 1MiB + # (Lines separated for easy reading) + + 0 1960893648 raid \ + raid4 1 2048 \ + 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 + + # RAID4 - 4 data drives, 1 parity (with metadata devices) + # Chunk size of 1MiB, force RAID initialization, + # min recovery rate at 20 kiB/sec/disk + + 0 1960893648 raid \ + raid4 4 2048 sync min_recovery_rate 20 \ + 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 + + +Status Output +------------- +'dmsetup table' displays the table used to construct the mapping. +The optional parameters are always printed in the order listed +above with "sync" or "nosync" always output ahead of the other +arguments, regardless of the order used when originally loading the table. +Arguments that can be repeated are ordered by value. + + +'dmsetup status' yields information on the state and health of the array. +The output is as follows (normally a single line, but expanded here for +clarity):: + + 1: raid \ + 2: <#devices> \ + 3: + +Line 1 is the standard output produced by device-mapper. + +Line 2 & 3 are produced by the raid target and are best explained by example:: + + 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 + +Here we can see the RAID type is raid4, there are 5 devices - all of +which are 'A'live, and the array is 2/490221568 complete with its initial +recovery. Here is a fuller description of the individual fields: + + =============== ========================================================= + Same as the used to create the array. + One char for each device, indicating: + + - 'A' = alive and in-sync + - 'a' = alive but not in-sync + - 'D' = dead/failed. + The ratio indicating how much of the array has undergone + the process described by 'sync_action'. If the + 'sync_action' is "check" or "repair", then the process + of "resync" or "recover" can be considered complete. + One of the following possible states: + + idle + - No synchronization action is being performed. + frozen + - The current action has been halted. + resync + - Array is undergoing its initial synchronization + or is resynchronizing after an unclean shutdown + (possibly aided by a bitmap). + recover + - A device in the array is being rebuilt or + replaced. + check + - A user-initiated full check of the array is + being performed. All blocks are read and + checked for consistency. The number of + discrepancies found are recorded in + . No changes are made to the + array by this action. + repair + - The same as "check", but discrepancies are + corrected. + reshape + - The array is undergoing a reshape. + The number of discrepancies found between mirror copies + in RAID1/10 or wrong parity values found in RAID4/5/6. + This value is valid only after a "check" of the array + is performed. A healthy array has a 'mismatch_cnt' of 0. + The current data offset to the start of the user data on + each component device of a raid set (see the respective + raid parameter to support out-of-place reshaping). + - 'A' - active write-through journal device. + - 'a' - active write-back journal device. + - 'D' - dead journal device. + - '-' - no journal device. + =============== ========================================================= + + +Message Interface +----------------- +The dm-raid target will accept certain actions through the 'message' interface. +('man dmsetup' for more information on the message interface.) These actions +include: + + ========= ================================================ + "idle" Halt the current sync action. + "frozen" Freeze the current sync action. + "resync" Initiate/continue a resync. + "recover" Initiate/continue a recover process. + "check" Initiate a check (i.e. a "scrub") of the array. + "repair" Initiate a repair of the array. + ========= ================================================ + + +Discard Support +--------------- +The implementation of discard support among hardware vendors varies. +When a block is discarded, some storage devices will return zeroes when +the block is read. These devices set the 'discard_zeroes_data' +attribute. Other devices will return random data. Confusingly, some +devices that advertise 'discard_zeroes_data' will not reliably return +zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks +from a number of devices to calculate parity blocks and (for performance +reasons) relies on 'discard_zeroes_data' being reliable, it is important +that the devices be consistent. Blocks may be discarded in the middle +of a RAID 4/5/6 stripe and if subsequent read results are not +consistent, the parity blocks may be calculated differently at any time; +making the parity blocks useless for redundancy. It is important to +understand how your hardware behaves with discards if you are going to +enable discards with RAID 4/5/6. + +Since the behavior of storage devices is unreliable in this respect, +even when reporting 'discard_zeroes_data', by default RAID 4/5/6 +discard support is disabled -- this ensures data integrity at the +expense of losing some performance. + +Storage devices that properly support 'discard_zeroes_data' are +increasingly whitelisted in the kernel and can thus be trusted. + +For trusted devices, the following dm-raid module parameter can be set +to safely enable discard support for RAID 4/5/6: + + 'devices_handle_discards_safely' + + +Version History +--------------- + +:: + + 1.0.0 Initial version. Support for RAID 4/5/6 + 1.1.0 Added support for RAID 1 + 1.2.0 Handle creation of arrays that contain failed devices. + 1.3.0 Added support for RAID 10 + 1.3.1 Allow device replacement/rebuild for RAID 10 + 1.3.2 Fix/improve redundancy checking for RAID10 + 1.4.0 Non-functional change. Removes arg from mapping function. + 1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5). + 1.4.2 Add RAID10 "far" and "offset" algorithm support. + 1.5.0 Add message interface to allow manipulation of the sync_action. + New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. + 1.5.1 Add ability to restore transiently failed devices on resume. + 1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". + 1.6.0 Add discard support (and devices_handle_discard_safely module param). + 1.7.0 Add support for MD RAID0 mappings. + 1.8.0 Explicitly check for compatible flags in the superblock metadata + and reject to start the raid set if any are set by a newer + target version, thus avoiding data corruption on a raid set + with a reshape in progress. + 1.9.0 Add support for RAID level takeover/reshape/region size + and set size reduction. + 1.9.1 Fix activation of existing RAID 4/10 mapped devices + 1.9.2 Don't emit '- -' on the status table line in case the constructor + fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and + 'D' on the status line. If '- -' is passed into the constructor, emit + '- -' on the table line and '-' as the status line health character. + 1.10.0 Add support for raid4/5/6 journal device + 1.10.1 Fix data corruption on reshape request + 1.11.0 Fix table line argument order + (wrong raid10_copies/raid10_format sequence) + 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option + 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available + 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') + 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an + state races. + 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen + 1.14.0 Fix reshape race on small devices. Fix stripe adding reshape + deadlock/potential data corruption. Update superblock when + specific devices are requested via rebuild. Fix RAID leg + rebuild errors. diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt deleted file mode 100644 index 2355bef14653..000000000000 --- a/Documentation/device-mapper/dm-raid.txt +++ /dev/null @@ -1,354 +0,0 @@ -dm-raid -======= - -The device-mapper RAID (dm-raid) target provides a bridge from DM to MD. -It allows the MD RAID drivers to be accessed using a device-mapper -interface. - - -Mapping Table Interface ------------------------ -The target is named "raid" and it accepts the following parameters: - - <#raid_params> \ - <#raid_devs> [.. ] - -: - raid0 RAID0 striping (no resilience) - raid1 RAID1 mirroring - raid4 RAID4 with dedicated last parity disk - raid5_n RAID5 with dedicated last parity disk supporting takeover - Same as raid4 - -Transitory layout - raid5_la RAID5 left asymmetric - - rotating parity 0 with data continuation - raid5_ra RAID5 right asymmetric - - rotating parity N with data continuation - raid5_ls RAID5 left symmetric - - rotating parity 0 with data restart - raid5_rs RAID5 right symmetric - - rotating parity N with data restart - raid6_zr RAID6 zero restart - - rotating parity zero (left-to-right) with data restart - raid6_nr RAID6 N restart - - rotating parity N (right-to-left) with data restart - raid6_nc RAID6 N continue - - rotating parity N (right-to-left) with data continuation - raid6_n_6 RAID6 with dedicate parity disks - - parity and Q-syndrome on the last 2 disks; - layout for takeover from/to raid4/raid5_n - raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk - - layout for takeover from raid5_la from/to raid6 - raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk - - layout for takeover from raid5_ra from/to raid6 - raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk - - layout for takeover from raid5_ls from/to raid6 - raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk - - layout for takeover from raid5_rs from/to raid6 - raid10 Various RAID10 inspired algorithms chosen by additional params - (see raid10_format and raid10_copies below) - - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') - - RAID1E: Integrated Adjacent Stripe Mirroring - - RAID1E: Integrated Offset Stripe Mirroring - - and other similar RAID10 variants - - Reference: Chapter 4 of - http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf - -<#raid_params>: The number of parameters that follow. - - consists of - Mandatory parameters: - : Chunk size in sectors. This parameter is often known as - "stripe size". It is the only mandatory parameter and - is placed first. - - followed by optional parameters (in any order): - [sync|nosync] Force or prevent RAID initialization. - - [rebuild ] Rebuild drive number 'idx' (first drive is 0). - - [daemon_sleep ] - Interval between runs of the bitmap daemon that - clear bits. A longer interval means less bitmap I/O but - resyncing after a failure is likely to take longer. - - [min_recovery_rate ] Throttle RAID initialization - [max_recovery_rate ] Throttle RAID initialization - [write_mostly ] Mark drive index 'idx' write-mostly. - [max_write_behind ] See '--write-behind=' (man mdadm) - [stripe_cache ] Stripe cache size (RAID 4/5/6 only) - [region_size ] - The region_size multiplied by the number of regions is the - logical size of the array. The bitmap records the device - synchronisation state for each region. - - [raid10_copies <# copies>] - [raid10_format ] - These two options are used to alter the default layout of - a RAID10 configuration. The number of copies is can be - specified, but the default is 2. There are also three - variations to how the copies are laid down - the default - is "near". Near copies are what most people think of with - respect to mirroring. If these options are left unspecified, - or 'raid10_copies 2' and/or 'raid10_format near' are given, - then the layouts for 2, 3 and 4 devices are: - 2 drives 3 drives 4 drives - -------- ---------- -------------- - A1 A1 A1 A1 A2 A1 A1 A2 A2 - A2 A2 A2 A3 A3 A3 A3 A4 A4 - A3 A3 A4 A4 A5 A5 A5 A6 A6 - A4 A4 A5 A6 A6 A7 A7 A8 A8 - .. .. .. .. .. .. .. .. .. - The 2-device layout is equivalent 2-way RAID1. The 4-device - layout is what a traditional RAID10 would look like. The - 3-device layout is what might be called a 'RAID1E - Integrated - Adjacent Stripe Mirroring'. - - If 'raid10_copies 2' and 'raid10_format far', then the layouts - for 2, 3 and 4 devices are: - 2 drives 3 drives 4 drives - -------- -------------- -------------------- - A1 A2 A1 A2 A3 A1 A2 A3 A4 - A3 A4 A4 A5 A6 A5 A6 A7 A8 - A5 A6 A7 A8 A9 A9 A10 A11 A12 - .. .. .. .. .. .. .. .. .. - A2 A1 A3 A1 A2 A2 A1 A4 A3 - A4 A3 A6 A4 A5 A6 A5 A8 A7 - A6 A5 A9 A7 A8 A10 A9 A12 A11 - .. .. .. .. .. .. .. .. .. - - If 'raid10_copies 2' and 'raid10_format offset', then the - layouts for 2, 3 and 4 devices are: - 2 drives 3 drives 4 drives - -------- ------------ ----------------- - A1 A2 A1 A2 A3 A1 A2 A3 A4 - A2 A1 A3 A1 A2 A2 A1 A4 A3 - A3 A4 A4 A5 A6 A5 A6 A7 A8 - A4 A3 A6 A4 A5 A6 A5 A8 A7 - A5 A6 A7 A8 A9 A9 A10 A11 A12 - A6 A5 A9 A7 A8 A10 A9 A12 A11 - .. .. .. .. .. .. .. .. .. - Here we see layouts closely akin to 'RAID1E - Integrated - Offset Stripe Mirroring'. - - [delta_disks ] - The delta_disks option value (-251 < N < +251) triggers - device removal (negative value) or device addition (positive - value) to any reshape supporting raid levels 4/5/6 and 10. - RAID levels 4/5/6 allow for addition of devices (metadata - and data device tuple), raid10_near and raid10_offset only - allow for device addition. raid10_far does not support any - reshaping at all. - A minimum of devices have to be kept to enforce resilience, - which is 3 devices for raid4/5 and 4 devices for raid6. - - [data_offset ] - This option value defines the offset into each data device - where the data starts. This is used to provide out-of-place - reshaping space to avoid writing over data while - changing the layout of stripes, hence an interruption/crash - may happen at any time without the risk of losing data. - E.g. when adding devices to an existing raid set during - forward reshaping, the out-of-place space will be allocated - at the beginning of each raid device. The kernel raid4/5/6/10 - MD personalities supporting such device addition will read the data from - the existing first stripes (those with smaller number of stripes) - starting at data_offset to fill up a new stripe with the larger - number of stripes, calculate the redundancy blocks (CRC/Q-syndrome) - and write that new stripe to offset 0. Same will be applied to all - N-1 other new stripes. This out-of-place scheme is used to change - the RAID type (i.e. the allocation algorithm) as well, e.g. - changing from raid5_ls to raid5_n. - - [journal_dev ] - This option adds a journal device to raid4/5/6 raid sets and - uses it to close the 'write hole' caused by the non-atomic updates - to the component devices which can cause data loss during recovery. - The journal device is used as writethrough thus causing writes to - be throttled versus non-journaled raid4/5/6 sets. - Takeover/reshape is not possible with a raid4/5/6 journal device; - it has to be deconfigured before requesting these. - - [journal_mode ] - This option sets the caching mode on journaled raid4/5/6 raid sets - (see 'journal_dev ' above) to 'writethrough' or 'writeback'. - If 'writeback' is selected the journal device has to be resilient - and must not suffer from the 'write hole' problem itself (e.g. use - raid1 or raid10) to avoid a single point of failure. - -<#raid_devs>: The number of devices composing the array. - Each device consists of two entries. The first is the device - containing the metadata (if any); the second is the one containing the - data. A Maximum of 64 metadata/data device entries are supported - up to target version 1.8.0. - 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime. - - If a drive has failed or is missing at creation time, a '-' can be - given for both the metadata and data drives for a given position. - - -Example Tables --------------- -# RAID4 - 4 data drives, 1 parity (no metadata devices) -# No metadata devices specified to hold superblock/bitmap info -# Chunk size of 1MiB -# (Lines separated for easy reading) - -0 1960893648 raid \ - raid4 1 2048 \ - 5 - 8:17 - 8:33 - 8:49 - 8:65 - 8:81 - -# RAID4 - 4 data drives, 1 parity (with metadata devices) -# Chunk size of 1MiB, force RAID initialization, -# min recovery rate at 20 kiB/sec/disk - -0 1960893648 raid \ - raid4 4 2048 sync min_recovery_rate 20 \ - 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 - - -Status Output -------------- -'dmsetup table' displays the table used to construct the mapping. -The optional parameters are always printed in the order listed -above with "sync" or "nosync" always output ahead of the other -arguments, regardless of the order used when originally loading the table. -Arguments that can be repeated are ordered by value. - - -'dmsetup status' yields information on the state and health of the array. -The output is as follows (normally a single line, but expanded here for -clarity): -1: raid \ -2: <#devices> \ -3: - -Line 1 is the standard output produced by device-mapper. -Line 2 & 3 are produced by the raid target and are best explained by example: - 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 -Here we can see the RAID type is raid4, there are 5 devices - all of -which are 'A'live, and the array is 2/490221568 complete with its initial -recovery. Here is a fuller description of the individual fields: - Same as the used to create the array. - One char for each device, indicating: 'A' = alive and - in-sync, 'a' = alive but not in-sync, 'D' = dead/failed. - The ratio indicating how much of the array has undergone - the process described by 'sync_action'. If the - 'sync_action' is "check" or "repair", then the process - of "resync" or "recover" can be considered complete. - One of the following possible states: - idle - No synchronization action is being performed. - frozen - The current action has been halted. - resync - Array is undergoing its initial synchronization - or is resynchronizing after an unclean shutdown - (possibly aided by a bitmap). - recover - A device in the array is being rebuilt or - replaced. - check - A user-initiated full check of the array is - being performed. All blocks are read and - checked for consistency. The number of - discrepancies found are recorded in - . No changes are made to the - array by this action. - repair - The same as "check", but discrepancies are - corrected. - reshape - The array is undergoing a reshape. - The number of discrepancies found between mirror copies - in RAID1/10 or wrong parity values found in RAID4/5/6. - This value is valid only after a "check" of the array - is performed. A healthy array has a 'mismatch_cnt' of 0. - The current data offset to the start of the user data on - each component device of a raid set (see the respective - raid parameter to support out-of-place reshaping). - 'A' - active write-through journal device. - 'a' - active write-back journal device. - 'D' - dead journal device. - '-' - no journal device. - - -Message Interface ------------------ -The dm-raid target will accept certain actions through the 'message' interface. -('man dmsetup' for more information on the message interface.) These actions -include: - "idle" - Halt the current sync action. - "frozen" - Freeze the current sync action. - "resync" - Initiate/continue a resync. - "recover"- Initiate/continue a recover process. - "check" - Initiate a check (i.e. a "scrub") of the array. - "repair" - Initiate a repair of the array. - - -Discard Support ---------------- -The implementation of discard support among hardware vendors varies. -When a block is discarded, some storage devices will return zeroes when -the block is read. These devices set the 'discard_zeroes_data' -attribute. Other devices will return random data. Confusingly, some -devices that advertise 'discard_zeroes_data' will not reliably return -zeroes when discarded blocks are read! Since RAID 4/5/6 uses blocks -from a number of devices to calculate parity blocks and (for performance -reasons) relies on 'discard_zeroes_data' being reliable, it is important -that the devices be consistent. Blocks may be discarded in the middle -of a RAID 4/5/6 stripe and if subsequent read results are not -consistent, the parity blocks may be calculated differently at any time; -making the parity blocks useless for redundancy. It is important to -understand how your hardware behaves with discards if you are going to -enable discards with RAID 4/5/6. - -Since the behavior of storage devices is unreliable in this respect, -even when reporting 'discard_zeroes_data', by default RAID 4/5/6 -discard support is disabled -- this ensures data integrity at the -expense of losing some performance. - -Storage devices that properly support 'discard_zeroes_data' are -increasingly whitelisted in the kernel and can thus be trusted. - -For trusted devices, the following dm-raid module parameter can be set -to safely enable discard support for RAID 4/5/6: - 'devices_handle_discards_safely' - - -Version History ---------------- -1.0.0 Initial version. Support for RAID 4/5/6 -1.1.0 Added support for RAID 1 -1.2.0 Handle creation of arrays that contain failed devices. -1.3.0 Added support for RAID 10 -1.3.1 Allow device replacement/rebuild for RAID 10 -1.3.2 Fix/improve redundancy checking for RAID10 -1.4.0 Non-functional change. Removes arg from mapping function. -1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5). -1.4.2 Add RAID10 "far" and "offset" algorithm support. -1.5.0 Add message interface to allow manipulation of the sync_action. - New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. -1.5.1 Add ability to restore transiently failed devices on resume. -1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". -1.6.0 Add discard support (and devices_handle_discard_safely module param). -1.7.0 Add support for MD RAID0 mappings. -1.8.0 Explicitly check for compatible flags in the superblock metadata - and reject to start the raid set if any are set by a newer - target version, thus avoiding data corruption on a raid set - with a reshape in progress. -1.9.0 Add support for RAID level takeover/reshape/region size - and set size reduction. -1.9.1 Fix activation of existing RAID 4/10 mapped devices -1.9.2 Don't emit '- -' on the status table line in case the constructor - fails reading a superblock. Correctly emit 'maj:min1 maj:min2' and - 'D' on the status line. If '- -' is passed into the constructor, emit - '- -' on the table line and '-' as the status line health character. -1.10.0 Add support for raid4/5/6 journal device -1.10.1 Fix data corruption on reshape request -1.11.0 Fix table line argument order - (wrong raid10_copies/raid10_format sequence) -1.11.1 Add raid4/5/6 journal write-back support via journal_mode option -1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available -1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') -1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an - state races. -1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen -1.14.0 Fix reshape race on small devices. Fix stripe adding reshape - deadlock/potential data corruption. Update superblock when - specific devices are requested via rebuild. Fix RAID leg - rebuild errors. diff --git a/Documentation/device-mapper/dm-service-time.rst b/Documentation/device-mapper/dm-service-time.rst new file mode 100644 index 000000000000..facf277fc13c --- /dev/null +++ b/Documentation/device-mapper/dm-service-time.rst @@ -0,0 +1,101 @@ +=============== +dm-service-time +=============== + +dm-service-time is a path selector module for device-mapper targets, +which selects a path with the shortest estimated service time for +the incoming I/O. + +The service time for each path is estimated by dividing the total size +of in-flight I/Os on a path with the performance value of the path. +The performance value is a relative throughput value among all paths +in a path-group, and it can be specified as a table argument. + +The path selector name is 'service-time'. + +Table parameters for each path: + + [ []] + : + The number of I/Os to dispatch using the selected + path before switching to the next path. + If not given, internal default is used. To check + the default value, see the activated table. + : + The relative throughput value of the path + among all paths in the path-group. + The valid range is 0-100. + If not given, minimum value '1' is used. + If '0' is given, the path isn't selected while + other paths having a positive value are available. + +Status for each path: + + + : + 'A' if the path is active, 'F' if the path is failed. + : + The number of path failures. + : + The size of in-flight I/Os on the path. + : + The relative throughput value of the path + among all paths in the path-group. + + +Algorithm +========= + +dm-service-time adds the I/O size to 'in-flight-size' when the I/O is +dispatched and subtracts when completed. +Basically, dm-service-time selects a path having minimum service time +which is calculated by:: + + ('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput' + +However, some optimizations below are used to reduce the calculation +as much as possible. + + 1. If the paths have the same 'relative_throughput', skip + the division and just compare the 'in-flight-size'. + + 2. If the paths have the same 'in-flight-size', skip the division + and just compare the 'relative_throughput'. + + 3. If some paths have non-zero 'relative_throughput' and others + have zero 'relative_throughput', ignore those paths with zero + 'relative_throughput'. + +If such optimizations can't be applied, calculate service time, and +compare service time. +If calculated service time is equal, the path having maximum +'relative_throughput' may be better. So compare 'relative_throughput' +then. + + +Examples +======== +In case that 2 paths (sda and sdb) are used with repeat_count == 128 +and sda has an average throughput 1GB/s and sdb has 4GB/s, +'relative_throughput' value may be '1' for sda and '4' for sdb:: + + # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \ + dmsetup create test + # + # dmsetup table + test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4 + # + # dmsetup status + test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4 + + +Or '2' for sda and '8' for sdb would be also true:: + + # echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \ + dmsetup create test + # + # dmsetup table + test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8 + # + # dmsetup status + test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8 diff --git a/Documentation/device-mapper/dm-service-time.txt b/Documentation/device-mapper/dm-service-time.txt deleted file mode 100644 index fb1d4a0cf122..000000000000 --- a/Documentation/device-mapper/dm-service-time.txt +++ /dev/null @@ -1,91 +0,0 @@ -dm-service-time -=============== - -dm-service-time is a path selector module for device-mapper targets, -which selects a path with the shortest estimated service time for -the incoming I/O. - -The service time for each path is estimated by dividing the total size -of in-flight I/Os on a path with the performance value of the path. -The performance value is a relative throughput value among all paths -in a path-group, and it can be specified as a table argument. - -The path selector name is 'service-time'. - -Table parameters for each path: [ []] - : The number of I/Os to dispatch using the selected - path before switching to the next path. - If not given, internal default is used. To check - the default value, see the activated table. - : The relative throughput value of the path - among all paths in the path-group. - The valid range is 0-100. - If not given, minimum value '1' is used. - If '0' is given, the path isn't selected while - other paths having a positive value are available. - -Status for each path: \ - - : 'A' if the path is active, 'F' if the path is failed. - : The number of path failures. - : The size of in-flight I/Os on the path. - : The relative throughput value of the path - among all paths in the path-group. - - -Algorithm -========= - -dm-service-time adds the I/O size to 'in-flight-size' when the I/O is -dispatched and subtracts when completed. -Basically, dm-service-time selects a path having minimum service time -which is calculated by: - - ('in-flight-size' + 'size-of-incoming-io') / 'relative_throughput' - -However, some optimizations below are used to reduce the calculation -as much as possible. - - 1. If the paths have the same 'relative_throughput', skip - the division and just compare the 'in-flight-size'. - - 2. If the paths have the same 'in-flight-size', skip the division - and just compare the 'relative_throughput'. - - 3. If some paths have non-zero 'relative_throughput' and others - have zero 'relative_throughput', ignore those paths with zero - 'relative_throughput'. - -If such optimizations can't be applied, calculate service time, and -compare service time. -If calculated service time is equal, the path having maximum -'relative_throughput' may be better. So compare 'relative_throughput' -then. - - -Examples -======== -In case that 2 paths (sda and sdb) are used with repeat_count == 128 -and sda has an average throughput 1GB/s and sdb has 4GB/s, -'relative_throughput' value may be '1' for sda and '4' for sdb. - -# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \ - dmsetup create test -# -# dmsetup table -test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4 -# -# dmsetup status -test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4 - - -Or '2' for sda and '8' for sdb would be also true. - -# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \ - dmsetup create test -# -# dmsetup table -test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8 -# -# dmsetup status -test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8 diff --git a/Documentation/device-mapper/dm-uevent.rst b/Documentation/device-mapper/dm-uevent.rst new file mode 100644 index 000000000000..4a8ee8d069c9 --- /dev/null +++ b/Documentation/device-mapper/dm-uevent.rst @@ -0,0 +1,110 @@ +==================== +device-mapper uevent +==================== + +The device-mapper uevent code adds the capability to device-mapper to create +and send kobject uevents (uevents). Previously device-mapper events were only +available through the ioctl interface. The advantage of the uevents interface +is the event contains environment attributes providing increased context for +the event avoiding the need to query the state of the device-mapper device after +the event is received. + +There are two functions currently for device-mapper events. The first function +listed creates the event and the second function sends the event(s):: + + void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti, + const char *path, unsigned nr_valid_paths) + + void dm_send_uevents(struct list_head *events, struct kobject *kobj) + + +The variables added to the uevent environment are: + +Variable Name: DM_TARGET +------------------------ +:Uevent Action(s): KOBJ_CHANGE +:Type: string +:Description: +:Value: Name of device-mapper target that generated the event. + +Variable Name: DM_ACTION +------------------------ +:Uevent Action(s): KOBJ_CHANGE +:Type: string +:Description: +:Value: Device-mapper specific action that caused the uevent action. + PATH_FAILED - A path has failed; + PATH_REINSTATED - A path has been reinstated. + +Variable Name: DM_SEQNUM +------------------------ +:Uevent Action(s): KOBJ_CHANGE +:Type: unsigned integer +:Description: A sequence number for this specific device-mapper device. +:Value: Valid unsigned integer range. + +Variable Name: DM_PATH +---------------------- +:Uevent Action(s): KOBJ_CHANGE +:Type: string +:Description: Major and minor number of the path device pertaining to this + event. +:Value: Path name in the form of "Major:Minor" + +Variable Name: DM_NR_VALID_PATHS +-------------------------------- +:Uevent Action(s): KOBJ_CHANGE +:Type: unsigned integer +:Description: +:Value: Valid unsigned integer range. + +Variable Name: DM_NAME +---------------------- +:Uevent Action(s): KOBJ_CHANGE +:Type: string +:Description: Name of the device-mapper device. +:Value: Name + +Variable Name: DM_UUID +---------------------- +:Uevent Action(s): KOBJ_CHANGE +:Type: string +:Description: UUID of the device-mapper device. +:Value: UUID. (Empty string if there isn't one.) + +An example of the uevents generated as captured by udevmonitor is shown +below + +1.) Path failure:: + + UEVENT[1192521009.711215] change@/block/dm-3 + ACTION=change + DEVPATH=/block/dm-3 + SUBSYSTEM=block + DM_TARGET=multipath + DM_ACTION=PATH_FAILED + DM_SEQNUM=1 + DM_PATH=8:32 + DM_NR_VALID_PATHS=0 + DM_NAME=mpath2 + DM_UUID=mpath-35333333000002328 + MINOR=3 + MAJOR=253 + SEQNUM=1130 + +2.) Path reinstate:: + + UEVENT[1192521132.989927] change@/block/dm-3 + ACTION=change + DEVPATH=/block/dm-3 + SUBSYSTEM=block + DM_TARGET=multipath + DM_ACTION=PATH_REINSTATED + DM_SEQNUM=2 + DM_PATH=8:32 + DM_NR_VALID_PATHS=1 + DM_NAME=mpath2 + DM_UUID=mpath-35333333000002328 + MINOR=3 + MAJOR=253 + SEQNUM=1131 diff --git a/Documentation/device-mapper/dm-uevent.txt b/Documentation/device-mapper/dm-uevent.txt deleted file mode 100644 index 07edbd85c714..000000000000 --- a/Documentation/device-mapper/dm-uevent.txt +++ /dev/null @@ -1,97 +0,0 @@ -The device-mapper uevent code adds the capability to device-mapper to create -and send kobject uevents (uevents). Previously device-mapper events were only -available through the ioctl interface. The advantage of the uevents interface -is the event contains environment attributes providing increased context for -the event avoiding the need to query the state of the device-mapper device after -the event is received. - -There are two functions currently for device-mapper events. The first function -listed creates the event and the second function sends the event(s). - -void dm_path_uevent(enum dm_uevent_type event_type, struct dm_target *ti, - const char *path, unsigned nr_valid_paths) - -void dm_send_uevents(struct list_head *events, struct kobject *kobj) - - -The variables added to the uevent environment are: - -Variable Name: DM_TARGET -Uevent Action(s): KOBJ_CHANGE -Type: string -Description: -Value: Name of device-mapper target that generated the event. - -Variable Name: DM_ACTION -Uevent Action(s): KOBJ_CHANGE -Type: string -Description: -Value: Device-mapper specific action that caused the uevent action. - PATH_FAILED - A path has failed. - PATH_REINSTATED - A path has been reinstated. - -Variable Name: DM_SEQNUM -Uevent Action(s): KOBJ_CHANGE -Type: unsigned integer -Description: A sequence number for this specific device-mapper device. -Value: Valid unsigned integer range. - -Variable Name: DM_PATH -Uevent Action(s): KOBJ_CHANGE -Type: string -Description: Major and minor number of the path device pertaining to this -event. -Value: Path name in the form of "Major:Minor" - -Variable Name: DM_NR_VALID_PATHS -Uevent Action(s): KOBJ_CHANGE -Type: unsigned integer -Description: -Value: Valid unsigned integer range. - -Variable Name: DM_NAME -Uevent Action(s): KOBJ_CHANGE -Type: string -Description: Name of the device-mapper device. -Value: Name - -Variable Name: DM_UUID -Uevent Action(s): KOBJ_CHANGE -Type: string -Description: UUID of the device-mapper device. -Value: UUID. (Empty string if there isn't one.) - -An example of the uevents generated as captured by udevmonitor is shown -below. - -1.) Path failure. -UEVENT[1192521009.711215] change@/block/dm-3 -ACTION=change -DEVPATH=/block/dm-3 -SUBSYSTEM=block -DM_TARGET=multipath -DM_ACTION=PATH_FAILED -DM_SEQNUM=1 -DM_PATH=8:32 -DM_NR_VALID_PATHS=0 -DM_NAME=mpath2 -DM_UUID=mpath-35333333000002328 -MINOR=3 -MAJOR=253 -SEQNUM=1130 - -2.) Path reinstate. -UEVENT[1192521132.989927] change@/block/dm-3 -ACTION=change -DEVPATH=/block/dm-3 -SUBSYSTEM=block -DM_TARGET=multipath -DM_ACTION=PATH_REINSTATED -DM_SEQNUM=2 -DM_PATH=8:32 -DM_NR_VALID_PATHS=1 -DM_NAME=mpath2 -DM_UUID=mpath-35333333000002328 -MINOR=3 -MAJOR=253 -SEQNUM=1131 diff --git a/Documentation/device-mapper/dm-zoned.rst b/Documentation/device-mapper/dm-zoned.rst new file mode 100644 index 000000000000..07f56ebc1730 --- /dev/null +++ b/Documentation/device-mapper/dm-zoned.rst @@ -0,0 +1,146 @@ +======== +dm-zoned +======== + +The dm-zoned device mapper target exposes a zoned block device (ZBC and +ZAC compliant devices) as a regular block device without any write +pattern constraints. In effect, it implements a drive-managed zoned +block device which hides from the user (a file system or an application +doing raw block device accesses) the sequential write constraints of +host-managed zoned block devices and can mitigate the potential +device-side performance degradation due to excessive random writes on +host-aware zoned block devices. + +For a more detailed description of the zoned block device models and +their constraints see (for SCSI devices): + +http://www.t10.org/drafts.htm#ZBC_Family + +and (for ATA devices): + +http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf + +The dm-zoned implementation is simple and minimizes system overhead (CPU +and memory usage as well as storage capacity loss). For a 10TB +host-managed disk with 256 MB zones, dm-zoned memory usage per disk +instance is at most 4.5 MB and as little as 5 zones will be used +internally for storing metadata and performaing reclaim operations. + +dm-zoned target devices are formatted and checked using the dmzadm +utility available at: + +https://github.com/hgst/dm-zoned-tools + +Algorithm +========= + +dm-zoned implements an on-disk buffering scheme to handle non-sequential +write accesses to the sequential zones of a zoned block device. +Conventional zones are used for caching as well as for storing internal +metadata. + +The zones of the device are separated into 2 types: + +1) Metadata zones: these are conventional zones used to store metadata. +Metadata zones are not reported as useable capacity to the user. + +2) Data zones: all remaining zones, the vast majority of which will be +sequential zones used exclusively to store user data. The conventional +zones of the device may be used also for buffering user random writes. +Data in these zones may be directly mapped to the conventional zone, but +later moved to a sequential zone so that the conventional zone can be +reused for buffering incoming random writes. + +dm-zoned exposes a logical device with a sector size of 4096 bytes, +irrespective of the physical sector size of the backend zoned block +device being used. This allows reducing the amount of metadata needed to +manage valid blocks (blocks written). + +The on-disk metadata format is as follows: + +1) The first block of the first conventional zone found contains the +super block which describes the on disk amount and position of metadata +blocks. + +2) Following the super block, a set of blocks is used to describe the +mapping of the logical device blocks. The mapping is done per chunk of +blocks, with the chunk size equal to the zoned block device size. The +mapping table is indexed by chunk number and each mapping entry +indicates the zone number of the device storing the chunk of data. Each +mapping entry may also indicate if the zone number of a conventional +zone used to buffer random modification to the data zone. + +3) A set of blocks used to store bitmaps indicating the validity of +blocks in the data zones follows the mapping table. A valid block is +defined as a block that was written and not discarded. For a buffered +data chunk, a block is always valid only in the data zone mapping the +chunk or in the buffer zone of the chunk. + +For a logical chunk mapped to a conventional zone, all write operations +are processed by directly writing to the zone. If the mapping zone is a +sequential zone, the write operation is processed directly only if the +write offset within the logical chunk is equal to the write pointer +offset within of the sequential data zone (i.e. the write operation is +aligned on the zone write pointer). Otherwise, write operations are +processed indirectly using a buffer zone. In that case, an unused +conventional zone is allocated and assigned to the chunk being +accessed. Writing a block to the buffer zone of a chunk will +automatically invalidate the same block in the sequential zone mapping +the chunk. If all blocks of the sequential zone become invalid, the zone +is freed and the chunk buffer zone becomes the primary zone mapping the +chunk, resulting in native random write performance similar to a regular +block device. + +Read operations are processed according to the block validity +information provided by the bitmaps. Valid blocks are read either from +the sequential zone mapping a chunk, or if the chunk is buffered, from +the buffer zone assigned. If the accessed chunk has no mapping, or the +accessed blocks are invalid, the read buffer is zeroed and the read +operation terminated. + +After some time, the limited number of convnetional zones available may +be exhausted (all used to map chunks or buffer sequential zones) and +unaligned writes to unbuffered chunks become impossible. To avoid this +situation, a reclaim process regularly scans used conventional zones and +tries to reclaim the least recently used zones by copying the valid +blocks of the buffer zone to a free sequential zone. Once the copy +completes, the chunk mapping is updated to point to the sequential zone +and the buffer zone freed for reuse. + +Metadata Protection +=================== + +To protect metadata against corruption in case of sudden power loss or +system crash, 2 sets of metadata zones are used. One set, the primary +set, is used as the main metadata region, while the secondary set is +used as a staging area. Modified metadata is first written to the +secondary set and validated by updating the super block in the secondary +set, a generation counter is used to indicate that this set contains the +newest metadata. Once this operation completes, in place of metadata +block updates can be done in the primary metadata set. This ensures that +one of the set is always consistent (all modifications committed or none +at all). Flush operations are used as a commit point. Upon reception of +a flush request, metadata modification activity is temporarily blocked +(for both incoming BIO processing and reclaim process) and all dirty +metadata blocks are staged and updated. Normal operation is then +resumed. Flushing metadata thus only temporarily delays write and +discard requests. Read requests can be processed concurrently while +metadata flush is being executed. + +Usage +===== + +A zoned block device must first be formatted using the dmzadm tool. This +will analyze the device zone configuration, determine where to place the +metadata sets on the device and initialize the metadata sets. + +Ex:: + + dmzadm --format /dev/sdxx + +For a formatted device, the target can be created normally with the +dmsetup utility. The only parameter that dm-zoned requires is the +underlying zoned block device name. Ex:: + + echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \ + dmsetup create dmz-`basename ${dev}` diff --git a/Documentation/device-mapper/dm-zoned.txt b/Documentation/device-mapper/dm-zoned.txt deleted file mode 100644 index 736fcc78d193..000000000000 --- a/Documentation/device-mapper/dm-zoned.txt +++ /dev/null @@ -1,144 +0,0 @@ -dm-zoned -======== - -The dm-zoned device mapper target exposes a zoned block device (ZBC and -ZAC compliant devices) as a regular block device without any write -pattern constraints. In effect, it implements a drive-managed zoned -block device which hides from the user (a file system or an application -doing raw block device accesses) the sequential write constraints of -host-managed zoned block devices and can mitigate the potential -device-side performance degradation due to excessive random writes on -host-aware zoned block devices. - -For a more detailed description of the zoned block device models and -their constraints see (for SCSI devices): - -http://www.t10.org/drafts.htm#ZBC_Family - -and (for ATA devices): - -http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf - -The dm-zoned implementation is simple and minimizes system overhead (CPU -and memory usage as well as storage capacity loss). For a 10TB -host-managed disk with 256 MB zones, dm-zoned memory usage per disk -instance is at most 4.5 MB and as little as 5 zones will be used -internally for storing metadata and performaing reclaim operations. - -dm-zoned target devices are formatted and checked using the dmzadm -utility available at: - -https://github.com/hgst/dm-zoned-tools - -Algorithm -========= - -dm-zoned implements an on-disk buffering scheme to handle non-sequential -write accesses to the sequential zones of a zoned block device. -Conventional zones are used for caching as well as for storing internal -metadata. - -The zones of the device are separated into 2 types: - -1) Metadata zones: these are conventional zones used to store metadata. -Metadata zones are not reported as useable capacity to the user. - -2) Data zones: all remaining zones, the vast majority of which will be -sequential zones used exclusively to store user data. The conventional -zones of the device may be used also for buffering user random writes. -Data in these zones may be directly mapped to the conventional zone, but -later moved to a sequential zone so that the conventional zone can be -reused for buffering incoming random writes. - -dm-zoned exposes a logical device with a sector size of 4096 bytes, -irrespective of the physical sector size of the backend zoned block -device being used. This allows reducing the amount of metadata needed to -manage valid blocks (blocks written). - -The on-disk metadata format is as follows: - -1) The first block of the first conventional zone found contains the -super block which describes the on disk amount and position of metadata -blocks. - -2) Following the super block, a set of blocks is used to describe the -mapping of the logical device blocks. The mapping is done per chunk of -blocks, with the chunk size equal to the zoned block device size. The -mapping table is indexed by chunk number and each mapping entry -indicates the zone number of the device storing the chunk of data. Each -mapping entry may also indicate if the zone number of a conventional -zone used to buffer random modification to the data zone. - -3) A set of blocks used to store bitmaps indicating the validity of -blocks in the data zones follows the mapping table. A valid block is -defined as a block that was written and not discarded. For a buffered -data chunk, a block is always valid only in the data zone mapping the -chunk or in the buffer zone of the chunk. - -For a logical chunk mapped to a conventional zone, all write operations -are processed by directly writing to the zone. If the mapping zone is a -sequential zone, the write operation is processed directly only if the -write offset within the logical chunk is equal to the write pointer -offset within of the sequential data zone (i.e. the write operation is -aligned on the zone write pointer). Otherwise, write operations are -processed indirectly using a buffer zone. In that case, an unused -conventional zone is allocated and assigned to the chunk being -accessed. Writing a block to the buffer zone of a chunk will -automatically invalidate the same block in the sequential zone mapping -the chunk. If all blocks of the sequential zone become invalid, the zone -is freed and the chunk buffer zone becomes the primary zone mapping the -chunk, resulting in native random write performance similar to a regular -block device. - -Read operations are processed according to the block validity -information provided by the bitmaps. Valid blocks are read either from -the sequential zone mapping a chunk, or if the chunk is buffered, from -the buffer zone assigned. If the accessed chunk has no mapping, or the -accessed blocks are invalid, the read buffer is zeroed and the read -operation terminated. - -After some time, the limited number of convnetional zones available may -be exhausted (all used to map chunks or buffer sequential zones) and -unaligned writes to unbuffered chunks become impossible. To avoid this -situation, a reclaim process regularly scans used conventional zones and -tries to reclaim the least recently used zones by copying the valid -blocks of the buffer zone to a free sequential zone. Once the copy -completes, the chunk mapping is updated to point to the sequential zone -and the buffer zone freed for reuse. - -Metadata Protection -=================== - -To protect metadata against corruption in case of sudden power loss or -system crash, 2 sets of metadata zones are used. One set, the primary -set, is used as the main metadata region, while the secondary set is -used as a staging area. Modified metadata is first written to the -secondary set and validated by updating the super block in the secondary -set, a generation counter is used to indicate that this set contains the -newest metadata. Once this operation completes, in place of metadata -block updates can be done in the primary metadata set. This ensures that -one of the set is always consistent (all modifications committed or none -at all). Flush operations are used as a commit point. Upon reception of -a flush request, metadata modification activity is temporarily blocked -(for both incoming BIO processing and reclaim process) and all dirty -metadata blocks are staged and updated. Normal operation is then -resumed. Flushing metadata thus only temporarily delays write and -discard requests. Read requests can be processed concurrently while -metadata flush is being executed. - -Usage -===== - -A zoned block device must first be formatted using the dmzadm tool. This -will analyze the device zone configuration, determine where to place the -metadata sets on the device and initialize the metadata sets. - -Ex: - -dmzadm --format /dev/sdxx - -For a formatted device, the target can be created normally with the -dmsetup utility. The only parameter that dm-zoned requires is the -underlying zoned block device name. Ex: - -echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}` diff --git a/Documentation/device-mapper/era.rst b/Documentation/device-mapper/era.rst new file mode 100644 index 000000000000..90dd5c670b9f --- /dev/null +++ b/Documentation/device-mapper/era.rst @@ -0,0 +1,116 @@ +====== +dm-era +====== + +Introduction +============ + +dm-era is a target that behaves similar to the linear target. In +addition it keeps track of which blocks were written within a user +defined period of time called an 'era'. Each era target instance +maintains the current era as a monotonically increasing 32-bit +counter. + +Use cases include tracking changed blocks for backup software, and +partially invalidating the contents of a cache to restore cache +coherency after rolling back a vendor snapshot. + +Constructor +=========== + +era + + ================ ====================================================== + metadata dev fast device holding the persistent metadata + origin dev device holding data blocks that may change + block size block size of origin data device, granularity that is + tracked by the target + ================ ====================================================== + +Messages +======== + +None of the dm messages take any arguments. + +checkpoint +---------- + +Possibly move to a new era. You shouldn't assume the era has +incremented. After sending this message, you should check the +current era via the status line. + +take_metadata_snap +------------------ + +Create a clone of the metadata, to allow a userland process to read it. + +drop_metadata_snap +------------------ + +Drop the metadata snapshot. + +Status +====== + + <#used metadata blocks>/<#total metadata blocks> + + +========================= ============================================== +metadata block size Fixed block size for each metadata block in + sectors +#used metadata blocks Number of metadata blocks used +#total metadata blocks Total number of metadata blocks +current era The current era +held metadata root The location, in blocks, of the metadata root + that has been 'held' for userspace read + access. '-' indicates there is no held root +========================= ============================================== + +Detailed use case +================= + +The scenario of invalidating a cache when rolling back a vendor +snapshot was the primary use case when developing this target: + +Taking a vendor snapshot +------------------------ + +- Send a checkpoint message to the era target +- Make a note of the current era in its status line +- Take vendor snapshot (the era and snapshot should be forever + associated now). + +Rolling back to an vendor snapshot +---------------------------------- + +- Cache enters passthrough mode (see: dm-cache's docs in cache.txt) +- Rollback vendor storage +- Take metadata snapshot +- Ascertain which blocks have been written since the snapshot was taken + by checking each block's era +- Invalidate those blocks in the caching software +- Cache returns to writeback/writethrough mode + +Memory usage +============ + +The target uses a bitset to record writes in the current era. It also +has a spare bitset ready for switching over to a new era. Other than +that it uses a few 4k blocks for updating metadata:: + + (4 * nr_blocks) bytes + buffers + +Resilience +========== + +Metadata is updated on disk before a write to a previously unwritten +block is performed. As such dm-era should not be effected by a hard +crash such as power failure. + +Userland tools +============== + +Userland tools are found in the increasingly poorly named +thin-provisioning-tools project: + + https://github.com/jthornber/thin-provisioning-tools diff --git a/Documentation/device-mapper/era.txt b/Documentation/device-mapper/era.txt deleted file mode 100644 index 3c6d01be3560..000000000000 --- a/Documentation/device-mapper/era.txt +++ /dev/null @@ -1,108 +0,0 @@ -Introduction -============ - -dm-era is a target that behaves similar to the linear target. In -addition it keeps track of which blocks were written within a user -defined period of time called an 'era'. Each era target instance -maintains the current era as a monotonically increasing 32-bit -counter. - -Use cases include tracking changed blocks for backup software, and -partially invalidating the contents of a cache to restore cache -coherency after rolling back a vendor snapshot. - -Constructor -=========== - - era - - metadata dev : fast device holding the persistent metadata - origin dev : device holding data blocks that may change - block size : block size of origin data device, granularity that is - tracked by the target - -Messages -======== - -None of the dm messages take any arguments. - -checkpoint ----------- - -Possibly move to a new era. You shouldn't assume the era has -incremented. After sending this message, you should check the -current era via the status line. - -take_metadata_snap ------------------- - -Create a clone of the metadata, to allow a userland process to read it. - -drop_metadata_snap ------------------- - -Drop the metadata snapshot. - -Status -====== - - <#used metadata blocks>/<#total metadata blocks> - - -metadata block size : Fixed block size for each metadata block in - sectors -#used metadata blocks : Number of metadata blocks used -#total metadata blocks : Total number of metadata blocks -current era : The current era -held metadata root : The location, in blocks, of the metadata root - that has been 'held' for userspace read - access. '-' indicates there is no held root - -Detailed use case -================= - -The scenario of invalidating a cache when rolling back a vendor -snapshot was the primary use case when developing this target: - -Taking a vendor snapshot ------------------------- - -- Send a checkpoint message to the era target -- Make a note of the current era in its status line -- Take vendor snapshot (the era and snapshot should be forever - associated now). - -Rolling back to an vendor snapshot ----------------------------------- - -- Cache enters passthrough mode (see: dm-cache's docs in cache.txt) -- Rollback vendor storage -- Take metadata snapshot -- Ascertain which blocks have been written since the snapshot was taken - by checking each block's era -- Invalidate those blocks in the caching software -- Cache returns to writeback/writethrough mode - -Memory usage -============ - -The target uses a bitset to record writes in the current era. It also -has a spare bitset ready for switching over to a new era. Other than -that it uses a few 4k blocks for updating metadata. - - (4 * nr_blocks) bytes + buffers - -Resilience -========== - -Metadata is updated on disk before a write to a previously unwritten -block is performed. As such dm-era should not be effected by a hard -crash such as power failure. - -Userland tools -============== - -Userland tools are found in the increasingly poorly named -thin-provisioning-tools project: - - https://github.com/jthornber/thin-provisioning-tools diff --git a/Documentation/device-mapper/index.rst b/Documentation/device-mapper/index.rst new file mode 100644 index 000000000000..105e253bc231 --- /dev/null +++ b/Documentation/device-mapper/index.rst @@ -0,0 +1,44 @@ +:orphan: + +============= +Device Mapper +============= + +.. toctree:: + :maxdepth: 1 + + cache-policies + cache + delay + dm-crypt + dm-flakey + dm-init + dm-integrity + dm-io + dm-log + dm-queue-length + dm-raid + dm-service-time + dm-uevent + dm-zoned + era + kcopyd + linear + log-writes + persistent-data + snapshot + statistics + striped + switch + thin-provisioning + unstriped + verity + writecache + zero + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/device-mapper/kcopyd.rst b/Documentation/device-mapper/kcopyd.rst new file mode 100644 index 000000000000..7651d395127f --- /dev/null +++ b/Documentation/device-mapper/kcopyd.rst @@ -0,0 +1,47 @@ +====== +kcopyd +====== + +Kcopyd provides the ability to copy a range of sectors from one block-device +to one or more other block-devices, with an asynchronous completion +notification. It is used by dm-snapshot and dm-mirror. + +Users of kcopyd must first create a client and indicate how many memory pages +to set aside for their copy jobs. This is done with a call to +kcopyd_client_create():: + + int kcopyd_client_create(unsigned int num_pages, + struct kcopyd_client **result); + +To start a copy job, the user must set up io_region structures to describe +the source and destinations of the copy. Each io_region indicates a +block-device along with the starting sector and size of the region. The source +of the copy is given as one io_region structure, and the destinations of the +copy are given as an array of io_region structures:: + + struct io_region { + struct block_device *bdev; + sector_t sector; + sector_t count; + }; + +To start the copy, the user calls kcopyd_copy(), passing in the client +pointer, pointers to the source and destination io_regions, the name of a +completion callback routine, and a pointer to some context data for the copy:: + + int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from, + unsigned int num_dests, struct io_region *dests, + unsigned int flags, kcopyd_notify_fn fn, void *context); + + typedef void (*kcopyd_notify_fn)(int read_err, unsigned int write_err, + void *context); + +When the copy completes, kcopyd will call the user's completion routine, +passing back the user's context pointer. It will also indicate if a read or +write error occurred during the copy. + +When a user is done with all their copy jobs, they should call +kcopyd_client_destroy() to delete the kcopyd client, which will release the +associated memory pages:: + + void kcopyd_client_destroy(struct kcopyd_client *kc); diff --git a/Documentation/device-mapper/kcopyd.txt b/Documentation/device-mapper/kcopyd.txt deleted file mode 100644 index 820382c4cecf..000000000000 --- a/Documentation/device-mapper/kcopyd.txt +++ /dev/null @@ -1,47 +0,0 @@ -kcopyd -====== - -Kcopyd provides the ability to copy a range of sectors from one block-device -to one or more other block-devices, with an asynchronous completion -notification. It is used by dm-snapshot and dm-mirror. - -Users of kcopyd must first create a client and indicate how many memory pages -to set aside for their copy jobs. This is done with a call to -kcopyd_client_create(). - - int kcopyd_client_create(unsigned int num_pages, - struct kcopyd_client **result); - -To start a copy job, the user must set up io_region structures to describe -the source and destinations of the copy. Each io_region indicates a -block-device along with the starting sector and size of the region. The source -of the copy is given as one io_region structure, and the destinations of the -copy are given as an array of io_region structures. - - struct io_region { - struct block_device *bdev; - sector_t sector; - sector_t count; - }; - -To start the copy, the user calls kcopyd_copy(), passing in the client -pointer, pointers to the source and destination io_regions, the name of a -completion callback routine, and a pointer to some context data for the copy. - - int kcopyd_copy(struct kcopyd_client *kc, struct io_region *from, - unsigned int num_dests, struct io_region *dests, - unsigned int flags, kcopyd_notify_fn fn, void *context); - - typedef void (*kcopyd_notify_fn)(int read_err, unsigned int write_err, - void *context); - -When the copy completes, kcopyd will call the user's completion routine, -passing back the user's context pointer. It will also indicate if a read or -write error occurred during the copy. - -When a user is done with all their copy jobs, they should call -kcopyd_client_destroy() to delete the kcopyd client, which will release the -associated memory pages. - - void kcopyd_client_destroy(struct kcopyd_client *kc); - diff --git a/Documentation/device-mapper/linear.rst b/Documentation/device-mapper/linear.rst new file mode 100644 index 000000000000..9d17fc6e64a9 --- /dev/null +++ b/Documentation/device-mapper/linear.rst @@ -0,0 +1,63 @@ +========= +dm-linear +========= + +Device-Mapper's "linear" target maps a linear range of the Device-Mapper +device onto a linear range of another device. This is the basic building +block of logical volume managers. + +Parameters: + : + Full pathname to the underlying block-device, or a + "major:minor" device-number. + : + Starting sector within the device. + + +Example scripts +=============== + +:: + + #!/bin/sh + # Create an identity mapping for a device + echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity + +:: + + #!/bin/sh + # Join 2 devices together + size1=`blockdev --getsz $1` + size2=`blockdev --getsz $2` + echo "0 $size1 linear $1 0 + $size1 $size2 linear $2 0" | dmsetup create joined + +:: + + #!/usr/bin/perl -w + # Split a device into 4M chunks and then join them together in reverse order. + + my $name = "reverse"; + my $extent_size = 4 * 1024 * 2; + my $dev = $ARGV[0]; + my $table = ""; + my $count = 0; + + if (!defined($dev)) { + die("Please specify a device.\n"); + } + + my $dev_size = `blockdev --getsz $dev`; + my $extents = int($dev_size / $extent_size) - + (($dev_size % $extent_size) ? 1 : 0); + + while ($extents > 0) { + my $this_start = $count * $extent_size; + $extents--; + $count++; + my $this_offset = $extents * $extent_size; + + $table .= "$this_start $extent_size linear $dev $this_offset\n"; + } + + `echo \"$table\" | dmsetup create $name`; diff --git a/Documentation/device-mapper/linear.txt b/Documentation/device-mapper/linear.txt deleted file mode 100644 index 7cb98d89d3f8..000000000000 --- a/Documentation/device-mapper/linear.txt +++ /dev/null @@ -1,61 +0,0 @@ -dm-linear -========= - -Device-Mapper's "linear" target maps a linear range of the Device-Mapper -device onto a linear range of another device. This is the basic building -block of logical volume managers. - -Parameters: - : Full pathname to the underlying block-device, or a - "major:minor" device-number. - : Starting sector within the device. - - -Example scripts -=============== -[[ -#!/bin/sh -# Create an identity mapping for a device -echo "0 `blockdev --getsz $1` linear $1 0" | dmsetup create identity -]] - - -[[ -#!/bin/sh -# Join 2 devices together -size1=`blockdev --getsz $1` -size2=`blockdev --getsz $2` -echo "0 $size1 linear $1 0 -$size1 $size2 linear $2 0" | dmsetup create joined -]] - - -[[ -#!/usr/bin/perl -w -# Split a device into 4M chunks and then join them together in reverse order. - -my $name = "reverse"; -my $extent_size = 4 * 1024 * 2; -my $dev = $ARGV[0]; -my $table = ""; -my $count = 0; - -if (!defined($dev)) { - die("Please specify a device.\n"); -} - -my $dev_size = `blockdev --getsz $dev`; -my $extents = int($dev_size / $extent_size) - - (($dev_size % $extent_size) ? 1 : 0); - -while ($extents > 0) { - my $this_start = $count * $extent_size; - $extents--; - $count++; - my $this_offset = $extents * $extent_size; - - $table .= "$this_start $extent_size linear $dev $this_offset\n"; -} - -`echo \"$table\" | dmsetup create $name`; -]] diff --git a/Documentation/device-mapper/log-writes.rst b/Documentation/device-mapper/log-writes.rst new file mode 100644 index 000000000000..23141f2ffb7c --- /dev/null +++ b/Documentation/device-mapper/log-writes.rst @@ -0,0 +1,145 @@ +============= +dm-log-writes +============= + +This target takes 2 devices, one to pass all IO to normally, and one to log all +of the write operations to. This is intended for file system developers wishing +to verify the integrity of metadata or data as the file system is written to. +There is a log_write_entry written for every WRITE request and the target is +able to take arbitrary data from userspace to insert into the log. The data +that is in the WRITE requests is copied into the log to make the replay happen +exactly as it happened originally. + +Log Ordering +============ + +We log things in order of completion once we are sure the write is no longer in +cache. This means that normal WRITE requests are not actually logged until the +next REQ_PREFLUSH request. This is to make it easier for userspace to replay +the log in a way that correlates to what is on disk and not what is in cache, +to make it easier to detect improper waiting/flushing. + +This works by attaching all WRITE requests to a list once the write completes. +Once we see a REQ_PREFLUSH request we splice this list onto the request and once +the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only +completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to +simulate the worst case scenario with regard to power failures. Consider the +following example (W means write, C means complete): + + W1,W2,W3,C3,C2,Wflush,C1,Cflush + +The log would show the following: + + W3,W2,flush,W1.... + +Again this is to simulate what is actually on disk, this allows us to detect +cases where a power failure at a particular point in time would create an +inconsistent file system. + +Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as +they complete as those requests will obviously bypass the device cache. + +Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would +have all the DISCARD requests, and then the WRITE requests and then the FLUSH +request. Consider the following example: + + WRITE block 1, DISCARD block 1, FLUSH + +If we logged DISCARD when it completed, the replay would look like this: + + DISCARD 1, WRITE 1, FLUSH + +which isn't quite what happened and wouldn't be caught during the log replay. + +Target interface +================ + +i) Constructor + + log-writes + + ============= ============================================== + dev_path Device that all of the IO will go to normally. + log_dev_path Device where the log entries are written to. + ============= ============================================== + +ii) Status + + <#logged entries> + + =========================== ======================== + #logged entries Number of logged entries + highest allocated sector Highest allocated sector + =========================== ======================== + +iii) Messages + + mark + + You can use a dmsetup message to set an arbitrary mark in a log. + For example say you want to fsck a file system after every + write, but first you need to replay up to the mkfs to make sure + we're fsck'ing something reasonable, you would do something like + this:: + + mkfs.btrfs -f /dev/mapper/log + dmsetup message log 0 mark mkfs + + + This would allow you to replay the log up to the mkfs mark and + then replay from that point on doing the fsck check in the + interval that you want. + + Every log has a mark at the end labeled "dm-log-writes-end". + +Userspace component +=================== + +There is a userspace tool that will replay the log for you in various ways. +It can be found here: https://github.com/josefbacik/log-writes + +Example usage +============= + +Say you want to test fsync on your file system. You would do something like +this:: + + TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" + dmsetup create log --table "$TABLE" + mkfs.btrfs -f /dev/mapper/log + dmsetup message log 0 mark mkfs + + mount /dev/mapper/log /mnt/btrfs-test + + dmsetup message log 0 mark fsync + md5sum /mnt/btrfs-test/foo + umount /mnt/btrfs-test + + dmsetup remove log + replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync + mount /dev/sdb /mnt/btrfs-test + md5sum /mnt/btrfs-test/foo + + + Another option is to do a complicated file system operation and verify the file + system is consistent during the entire operation. You could do this with: + + TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" + dmsetup create log --table "$TABLE" + mkfs.btrfs -f /dev/mapper/log + dmsetup message log 0 mark mkfs + + mount /dev/mapper/log /mnt/btrfs-test + + btrfs filesystem balance /mnt/btrfs-test + umount /mnt/btrfs-test + dmsetup remove log + + replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs + btrfsck /dev/sdb + replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ + --fsck "btrfsck /dev/sdb" --check fua + +And that will replay the log until it sees a FUA request, run the fsck command +and if the fsck passes it will replay to the next FUA, until it is completed or +the fsck command exists abnormally. diff --git a/Documentation/device-mapper/log-writes.txt b/Documentation/device-mapper/log-writes.txt deleted file mode 100644 index b638d124be6a..000000000000 --- a/Documentation/device-mapper/log-writes.txt +++ /dev/null @@ -1,140 +0,0 @@ -dm-log-writes -============= - -This target takes 2 devices, one to pass all IO to normally, and one to log all -of the write operations to. This is intended for file system developers wishing -to verify the integrity of metadata or data as the file system is written to. -There is a log_write_entry written for every WRITE request and the target is -able to take arbitrary data from userspace to insert into the log. The data -that is in the WRITE requests is copied into the log to make the replay happen -exactly as it happened originally. - -Log Ordering -============ - -We log things in order of completion once we are sure the write is no longer in -cache. This means that normal WRITE requests are not actually logged until the -next REQ_PREFLUSH request. This is to make it easier for userspace to replay -the log in a way that correlates to what is on disk and not what is in cache, -to make it easier to detect improper waiting/flushing. - -This works by attaching all WRITE requests to a list once the write completes. -Once we see a REQ_PREFLUSH request we splice this list onto the request and once -the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only -completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to -simulate the worst case scenario with regard to power failures. Consider the -following example (W means write, C means complete): - -W1,W2,W3,C3,C2,Wflush,C1,Cflush - -The log would show the following - -W3,W2,flush,W1.... - -Again this is to simulate what is actually on disk, this allows us to detect -cases where a power failure at a particular point in time would create an -inconsistent file system. - -Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as -they complete as those requests will obviously bypass the device cache. - -Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would -have all the DISCARD requests, and then the WRITE requests and then the FLUSH -request. Consider the following example: - -WRITE block 1, DISCARD block 1, FLUSH - -If we logged DISCARD when it completed, the replay would look like this - -DISCARD 1, WRITE 1, FLUSH - -which isn't quite what happened and wouldn't be caught during the log replay. - -Target interface -================ - -i) Constructor - - log-writes - - dev_path : Device that all of the IO will go to normally. - log_dev_path : Device where the log entries are written to. - -ii) Status - - <#logged entries> - - #logged entries : Number of logged entries - highest allocated sector : Highest allocated sector - -iii) Messages - - mark - - You can use a dmsetup message to set an arbitrary mark in a log. - For example say you want to fsck a file system after every - write, but first you need to replay up to the mkfs to make sure - we're fsck'ing something reasonable, you would do something like - this: - - mkfs.btrfs -f /dev/mapper/log - dmsetup message log 0 mark mkfs - - - This would allow you to replay the log up to the mkfs mark and - then replay from that point on doing the fsck check in the - interval that you want. - - Every log has a mark at the end labeled "dm-log-writes-end". - -Userspace component -=================== - -There is a userspace tool that will replay the log for you in various ways. -It can be found here: https://github.com/josefbacik/log-writes - -Example usage -============= - -Say you want to test fsync on your file system. You would do something like -this: - -TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" -dmsetup create log --table "$TABLE" -mkfs.btrfs -f /dev/mapper/log -dmsetup message log 0 mark mkfs - -mount /dev/mapper/log /mnt/btrfs-test - -dmsetup message log 0 mark fsync -md5sum /mnt/btrfs-test/foo -umount /mnt/btrfs-test - -dmsetup remove log -replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync -mount /dev/sdb /mnt/btrfs-test -md5sum /mnt/btrfs-test/foo - - -Another option is to do a complicated file system operation and verify the file -system is consistent during the entire operation. You could do this with: - -TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" -dmsetup create log --table "$TABLE" -mkfs.btrfs -f /dev/mapper/log -dmsetup message log 0 mark mkfs - -mount /dev/mapper/log /mnt/btrfs-test - -btrfs filesystem balance /mnt/btrfs-test -umount /mnt/btrfs-test -dmsetup remove log - -replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs -btrfsck /dev/sdb -replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ - --fsck "btrfsck /dev/sdb" --check fua - -And that will replay the log until it sees a FUA request, run the fsck command -and if the fsck passes it will replay to the next FUA, until it is completed or -the fsck command exists abnormally. diff --git a/Documentation/device-mapper/persistent-data.rst b/Documentation/device-mapper/persistent-data.rst new file mode 100644 index 000000000000..2065c3c5a091 --- /dev/null +++ b/Documentation/device-mapper/persistent-data.rst @@ -0,0 +1,88 @@ +=============== +Persistent data +=============== + +Introduction +============ + +The more-sophisticated device-mapper targets require complex metadata +that is managed in kernel. In late 2010 we were seeing that various +different targets were rolling their own data structures, for example: + +- Mikulas Patocka's multisnap implementation +- Heinz Mauelshagen's thin provisioning target +- Another btree-based caching target posted to dm-devel +- Another multi-snapshot target based on a design of Daniel Phillips + +Maintaining these data structures takes a lot of work, so if possible +we'd like to reduce the number. + +The persistent-data library is an attempt to provide a re-usable +framework for people who want to store metadata in device-mapper +targets. It's currently used by the thin-provisioning target and an +upcoming hierarchical storage target. + +Overview +======== + +The main documentation is in the header files which can all be found +under drivers/md/persistent-data. + +The block manager +----------------- + +dm-block-manager.[hc] + +This provides access to the data on disk in fixed sized-blocks. There +is a read/write locking interface to prevent concurrent accesses, and +keep data that is being used in the cache. + +Clients of persistent-data are unlikely to use this directly. + +The transaction manager +----------------------- + +dm-transaction-manager.[hc] + +This restricts access to blocks and enforces copy-on-write semantics. +The only way you can get hold of a writable block through the +transaction manager is by shadowing an existing block (ie. doing +copy-on-write) or allocating a fresh one. Shadowing is elided within +the same transaction so performance is reasonable. The commit method +ensures that all data is flushed before it writes the superblock. +On power failure your metadata will be as it was when last committed. + +The Space Maps +-------------- + +dm-space-map.h +dm-space-map-metadata.[hc] +dm-space-map-disk.[hc] + +On-disk data structures that keep track of reference counts of blocks. +Also acts as the allocator of new blocks. Currently two +implementations: a simpler one for managing blocks on a different +device (eg. thinly-provisioned data blocks); and one for managing +the metadata space. The latter is complicated by the need to store +its own data within the space it's managing. + +The data structures +------------------- + +dm-btree.[hc] +dm-btree-remove.c +dm-btree-spine.c +dm-btree-internal.h + +Currently there is only one data structure, a hierarchical btree. +There are plans to add more. For example, something with an +array-like interface would see a lot of use. + +The btree is 'hierarchical' in that you can define it to be composed +of nested btrees, and take multiple keys. For example, the +thin-provisioning target uses a btree with two levels of nesting. +The first maps a device id to a mapping tree, and that in turn maps a +virtual block to a physical block. + +Values stored in the btrees can have arbitrary size. Keys are always +64bits, although nesting allows you to use multiple keys. diff --git a/Documentation/device-mapper/persistent-data.txt b/Documentation/device-mapper/persistent-data.txt deleted file mode 100644 index a333bcb3a6c2..000000000000 --- a/Documentation/device-mapper/persistent-data.txt +++ /dev/null @@ -1,84 +0,0 @@ -Introduction -============ - -The more-sophisticated device-mapper targets require complex metadata -that is managed in kernel. In late 2010 we were seeing that various -different targets were rolling their own data structures, for example: - -- Mikulas Patocka's multisnap implementation -- Heinz Mauelshagen's thin provisioning target -- Another btree-based caching target posted to dm-devel -- Another multi-snapshot target based on a design of Daniel Phillips - -Maintaining these data structures takes a lot of work, so if possible -we'd like to reduce the number. - -The persistent-data library is an attempt to provide a re-usable -framework for people who want to store metadata in device-mapper -targets. It's currently used by the thin-provisioning target and an -upcoming hierarchical storage target. - -Overview -======== - -The main documentation is in the header files which can all be found -under drivers/md/persistent-data. - -The block manager ------------------ - -dm-block-manager.[hc] - -This provides access to the data on disk in fixed sized-blocks. There -is a read/write locking interface to prevent concurrent accesses, and -keep data that is being used in the cache. - -Clients of persistent-data are unlikely to use this directly. - -The transaction manager ------------------------ - -dm-transaction-manager.[hc] - -This restricts access to blocks and enforces copy-on-write semantics. -The only way you can get hold of a writable block through the -transaction manager is by shadowing an existing block (ie. doing -copy-on-write) or allocating a fresh one. Shadowing is elided within -the same transaction so performance is reasonable. The commit method -ensures that all data is flushed before it writes the superblock. -On power failure your metadata will be as it was when last committed. - -The Space Maps --------------- - -dm-space-map.h -dm-space-map-metadata.[hc] -dm-space-map-disk.[hc] - -On-disk data structures that keep track of reference counts of blocks. -Also acts as the allocator of new blocks. Currently two -implementations: a simpler one for managing blocks on a different -device (eg. thinly-provisioned data blocks); and one for managing -the metadata space. The latter is complicated by the need to store -its own data within the space it's managing. - -The data structures -------------------- - -dm-btree.[hc] -dm-btree-remove.c -dm-btree-spine.c -dm-btree-internal.h - -Currently there is only one data structure, a hierarchical btree. -There are plans to add more. For example, something with an -array-like interface would see a lot of use. - -The btree is 'hierarchical' in that you can define it to be composed -of nested btrees, and take multiple keys. For example, the -thin-provisioning target uses a btree with two levels of nesting. -The first maps a device id to a mapping tree, and that in turn maps a -virtual block to a physical block. - -Values stored in the btrees can have arbitrary size. Keys are always -64bits, although nesting allows you to use multiple keys. diff --git a/Documentation/device-mapper/snapshot.rst b/Documentation/device-mapper/snapshot.rst new file mode 100644 index 000000000000..4c53304e72f1 --- /dev/null +++ b/Documentation/device-mapper/snapshot.rst @@ -0,0 +1,180 @@ +============================== +Device-mapper snapshot support +============================== + +Device-mapper allows you, without massive data copying: + +- To create snapshots of any block device i.e. mountable, saved states of + the block device which are also writable without interfering with the + original content; +- To create device "forks", i.e. multiple different versions of the + same data stream. +- To merge a snapshot of a block device back into the snapshot's origin + device. + +In the first two cases, dm copies only the chunks of data that get +changed and uses a separate copy-on-write (COW) block device for +storage. + +For snapshot merge the contents of the COW storage are merged back into +the origin device. + + +There are three dm targets available: +snapshot, snapshot-origin, and snapshot-merge. + +- snapshot-origin + +which will normally have one or more snapshots based on it. +Reads will be mapped directly to the backing device. For each write, the +original data will be saved in the of each snapshot to keep +its visible content unchanged, at least until the fills up. + + +- snapshot + +A snapshot of the block device is created. Changed chunks of + sectors will be stored on the . Writes will +only go to the . Reads will come from the or +from for unchanged data. will often be +smaller than the origin and if it fills up the snapshot will become +useless and be disabled, returning errors. So it is important to monitor +the amount of free space and expand the before it fills up. + + is P (Persistent) or N (Not persistent - will not survive +after reboot). O (Overflow) can be added as a persistent store option +to allow userspace to advertise its support for seeing "Overflow" in the +snapshot status. So supported store types are "P", "PO" and "N". + +The difference between persistent and transient is with transient +snapshots less metadata must be saved on disk - they can be kept in +memory by the kernel. + +When loading or unloading the snapshot target, the corresponding +snapshot-origin or snapshot-merge target must be suspended. A failure to +suspend the origin target could result in data corruption. + + +* snapshot-merge + +takes the same table arguments as the snapshot target except it only +works with persistent snapshots. This target assumes the role of the +"snapshot-origin" target and must not be loaded if the "snapshot-origin" +is still present for . + +Creates a merging snapshot that takes control of the changed chunks +stored in the of an existing snapshot, through a handover +procedure, and merges these chunks back into the . Once merging +has started (in the background) the may be opened and the merge +will continue while I/O is flowing to it. Changes to the are +deferred until the merging snapshot's corresponding chunk(s) have been +merged. Once merging has started the snapshot device, associated with +the "snapshot" target, will return -EIO when accessed. + + +How snapshot is used by LVM2 +============================ +When you create the first LVM2 snapshot of a volume, four dm devices are used: + +1) a device containing the original mapping table of the source volume; +2) a device used as the ; +3) a "snapshot" device, combining #1 and #2, which is the visible snapshot + volume; +4) the "original" volume (which uses the device number used by the original + source volume), whose table is replaced by a "snapshot-origin" mapping + from device #1. + +A fixed naming scheme is used, so with the following commands:: + + lvcreate -L 1G -n base volumeGroup + lvcreate -L 100M --snapshot -n snap volumeGroup/base + +we'll have this situation (with volumes in above order):: + + # dmsetup table|grep volumeGroup + + volumeGroup-base-real: 0 2097152 linear 8:19 384 + volumeGroup-snap-cow: 0 204800 linear 8:19 2097536 + volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16 + volumeGroup-base: 0 2097152 snapshot-origin 254:11 + + # ls -lL /dev/mapper/volumeGroup-* + brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real + brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow + brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap + brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base + + +How snapshot-merge is used by LVM2 +================================== +A merging snapshot assumes the role of the "snapshot-origin" while +merging. As such the "snapshot-origin" is replaced with +"snapshot-merge". The "-real" device is not changed and the "-cow" +device is renamed to -cow to aid LVM2's cleanup of the +merging snapshot after it completes. The "snapshot" that hands over its +COW device to the "snapshot-merge" is deactivated (unless using lvchange +--refresh); but if it is left active it will simply return I/O errors. + +A snapshot will merge into its origin with the following command:: + + lvconvert --merge volumeGroup/snap + +we'll now have this situation:: + + # dmsetup table|grep volumeGroup + + volumeGroup-base-real: 0 2097152 linear 8:19 384 + volumeGroup-base-cow: 0 204800 linear 8:19 2097536 + volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16 + + # ls -lL /dev/mapper/volumeGroup-* + brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real + brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow + brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base + + +How to determine when a merging is complete +=========================================== +The snapshot-merge and snapshot status lines end with: + + / + +Both and include both data and metadata. +During merging, the number of sectors allocated gets smaller and +smaller. Merging has finished when the number of sectors holding data +is zero, in other words == . + +Here is a practical example (using a hybrid of lvm and dmsetup commands):: + + # lvs + LV VG Attr LSize Origin Snap% Move Log Copy% Convert + base volumeGroup owi-a- 4.00g + snap volumeGroup swi-a- 1.00g base 18.97 + + # dmsetup status volumeGroup-snap + 0 8388608 snapshot 397896/2097152 1560 + ^^^^ metadata sectors + + # lvconvert --merge -b volumeGroup/snap + Merging of volume snap started. + + # lvs volumeGroup/snap + LV VG Attr LSize Origin Snap% Move Log Copy% Convert + base volumeGroup Owi-a- 4.00g 17.23 + + # dmsetup status volumeGroup-base + 0 8388608 snapshot-merge 281688/2097152 1104 + + # dmsetup status volumeGroup-base + 0 8388608 snapshot-merge 180480/2097152 712 + + # dmsetup status volumeGroup-base + 0 8388608 snapshot-merge 16/2097152 16 + +Merging has finished. + +:: + + # lvs + LV VG Attr LSize Origin Snap% Move Log Copy% Convert + base volumeGroup owi-a- 4.00g diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt deleted file mode 100644 index b8bbb516f989..000000000000 --- a/Documentation/device-mapper/snapshot.txt +++ /dev/null @@ -1,176 +0,0 @@ -Device-mapper snapshot support -============================== - -Device-mapper allows you, without massive data copying: - -*) To create snapshots of any block device i.e. mountable, saved states of -the block device which are also writable without interfering with the -original content; -*) To create device "forks", i.e. multiple different versions of the -same data stream. -*) To merge a snapshot of a block device back into the snapshot's origin -device. - -In the first two cases, dm copies only the chunks of data that get -changed and uses a separate copy-on-write (COW) block device for -storage. - -For snapshot merge the contents of the COW storage are merged back into -the origin device. - - -There are three dm targets available: -snapshot, snapshot-origin, and snapshot-merge. - -*) snapshot-origin - -which will normally have one or more snapshots based on it. -Reads will be mapped directly to the backing device. For each write, the -original data will be saved in the of each snapshot to keep -its visible content unchanged, at least until the fills up. - - -*) snapshot - -A snapshot of the block device is created. Changed chunks of - sectors will be stored on the . Writes will -only go to the . Reads will come from the or -from for unchanged data. will often be -smaller than the origin and if it fills up the snapshot will become -useless and be disabled, returning errors. So it is important to monitor -the amount of free space and expand the before it fills up. - - is P (Persistent) or N (Not persistent - will not survive -after reboot). O (Overflow) can be added as a persistent store option -to allow userspace to advertise its support for seeing "Overflow" in the -snapshot status. So supported store types are "P", "PO" and "N". - -The difference between persistent and transient is with transient -snapshots less metadata must be saved on disk - they can be kept in -memory by the kernel. - -When loading or unloading the snapshot target, the corresponding -snapshot-origin or snapshot-merge target must be suspended. A failure to -suspend the origin target could result in data corruption. - - -* snapshot-merge - -takes the same table arguments as the snapshot target except it only -works with persistent snapshots. This target assumes the role of the -"snapshot-origin" target and must not be loaded if the "snapshot-origin" -is still present for . - -Creates a merging snapshot that takes control of the changed chunks -stored in the of an existing snapshot, through a handover -procedure, and merges these chunks back into the . Once merging -has started (in the background) the may be opened and the merge -will continue while I/O is flowing to it. Changes to the are -deferred until the merging snapshot's corresponding chunk(s) have been -merged. Once merging has started the snapshot device, associated with -the "snapshot" target, will return -EIO when accessed. - - -How snapshot is used by LVM2 -============================ -When you create the first LVM2 snapshot of a volume, four dm devices are used: - -1) a device containing the original mapping table of the source volume; -2) a device used as the ; -3) a "snapshot" device, combining #1 and #2, which is the visible snapshot - volume; -4) the "original" volume (which uses the device number used by the original - source volume), whose table is replaced by a "snapshot-origin" mapping - from device #1. - -A fixed naming scheme is used, so with the following commands: - -lvcreate -L 1G -n base volumeGroup -lvcreate -L 100M --snapshot -n snap volumeGroup/base - -we'll have this situation (with volumes in above order): - -# dmsetup table|grep volumeGroup - -volumeGroup-base-real: 0 2097152 linear 8:19 384 -volumeGroup-snap-cow: 0 204800 linear 8:19 2097536 -volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16 -volumeGroup-base: 0 2097152 snapshot-origin 254:11 - -# ls -lL /dev/mapper/volumeGroup-* -brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real -brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow -brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap -brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base - - -How snapshot-merge is used by LVM2 -================================== -A merging snapshot assumes the role of the "snapshot-origin" while -merging. As such the "snapshot-origin" is replaced with -"snapshot-merge". The "-real" device is not changed and the "-cow" -device is renamed to -cow to aid LVM2's cleanup of the -merging snapshot after it completes. The "snapshot" that hands over its -COW device to the "snapshot-merge" is deactivated (unless using lvchange ---refresh); but if it is left active it will simply return I/O errors. - -A snapshot will merge into its origin with the following command: - -lvconvert --merge volumeGroup/snap - -we'll now have this situation: - -# dmsetup table|grep volumeGroup - -volumeGroup-base-real: 0 2097152 linear 8:19 384 -volumeGroup-base-cow: 0 204800 linear 8:19 2097536 -volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16 - -# ls -lL /dev/mapper/volumeGroup-* -brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real -brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow -brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base - - -How to determine when a merging is complete -=========================================== -The snapshot-merge and snapshot status lines end with: - / - -Both and include both data and metadata. -During merging, the number of sectors allocated gets smaller and -smaller. Merging has finished when the number of sectors holding data -is zero, in other words == . - -Here is a practical example (using a hybrid of lvm and dmsetup commands): - -# lvs - LV VG Attr LSize Origin Snap% Move Log Copy% Convert - base volumeGroup owi-a- 4.00g - snap volumeGroup swi-a- 1.00g base 18.97 - -# dmsetup status volumeGroup-snap -0 8388608 snapshot 397896/2097152 1560 - ^^^^ metadata sectors - -# lvconvert --merge -b volumeGroup/snap - Merging of volume snap started. - -# lvs volumeGroup/snap - LV VG Attr LSize Origin Snap% Move Log Copy% Convert - base volumeGroup Owi-a- 4.00g 17.23 - -# dmsetup status volumeGroup-base -0 8388608 snapshot-merge 281688/2097152 1104 - -# dmsetup status volumeGroup-base -0 8388608 snapshot-merge 180480/2097152 712 - -# dmsetup status volumeGroup-base -0 8388608 snapshot-merge 16/2097152 16 - -Merging has finished. - -# lvs - LV VG Attr LSize Origin Snap% Move Log Copy% Convert - base volumeGroup owi-a- 4.00g diff --git a/Documentation/device-mapper/statistics.rst b/Documentation/device-mapper/statistics.rst new file mode 100644 index 000000000000..3d80a9f850cc --- /dev/null +++ b/Documentation/device-mapper/statistics.rst @@ -0,0 +1,225 @@ +============= +DM statistics +============= + +Device Mapper supports the collection of I/O statistics on user-defined +regions of a DM device. If no regions are defined no statistics are +collected so there isn't any performance impact. Only bio-based DM +devices are currently supported. + +Each user-defined region specifies a starting sector, length and step. +Individual statistics will be collected for each step-sized area within +the range specified. + +The I/O statistics counters for each step-sized area of a region are +in the same format as `/sys/block/*/stat` or `/proc/diskstats` (see: +Documentation/iostats.txt). But two extra counters (12 and 13) are +provided: total time spent reading and writing. When the histogram +argument is used, the 14th parameter is reported that represents the +histogram of latencies. All these counters may be accessed by sending +the @stats_print message to the appropriate DM device via dmsetup. + +The reported times are in milliseconds and the granularity depends on +the kernel ticks. When the option precise_timestamps is used, the +reported times are in nanoseconds. + +Each region has a corresponding unique identifier, which we call a +region_id, that is assigned when the region is created. The region_id +must be supplied when querying statistics about the region, deleting the +region, etc. Unique region_ids enable multiple userspace programs to +request and process statistics for the same DM device without stepping +on each other's data. + +The creation of DM statistics will allocate memory via kmalloc or +fallback to using vmalloc space. At most, 1/4 of the overall system +memory may be allocated by DM statistics. The admin can see how much +memory is used by reading: + + /sys/module/dm_mod/parameters/stats_current_allocated_bytes + +Messages +======== + + @stats_create [ ...] [ []] + Create a new region and return the region_id. + + + "-" + whole device + "+" + a range of 512-byte sectors + starting with . + + + "" + the range is subdivided into areas each containing + sectors. + "/" + the range is subdivided into the specified + number of areas. + + + The number of optional arguments + + + The following optional arguments are supported: + + precise_timestamps + use precise timer with nanosecond resolution + instead of the "jiffies" variable. When this argument is + used, the resulting times are in nanoseconds instead of + milliseconds. Precise timestamps are a little bit slower + to obtain than jiffies-based timestamps. + histogram:n1,n2,n3,n4,... + collect histogram of latencies. The + numbers n1, n2, etc are times that represent the boundaries + of the histogram. If precise_timestamps is not used, the + times are in milliseconds, otherwise they are in + nanoseconds. For each range, the kernel will report the + number of requests that completed within this range. For + example, if we use "histogram:10,20,30", the kernel will + report four numbers a:b:c:d. a is the number of requests + that took 0-10 ms to complete, b is the number of requests + that took 10-20 ms to complete, c is the number of requests + that took 20-30 ms to complete and d is the number of + requests that took more than 30 ms to complete. + + + An optional parameter. A name that uniquely identifies + the userspace owner of the range. This groups ranges together + so that userspace programs can identify the ranges they + created and ignore those created by others. + The kernel returns this string back in the output of + @stats_list message, but it doesn't use it for anything else. + If we omit the number of optional arguments, program id must not + be a number, otherwise it would be interpreted as the number of + optional arguments. + + + An optional parameter. A word that provides auxiliary data + that is useful to the client program that created the range. + The kernel returns this string back in the output of + @stats_list message, but it doesn't use this value for anything. + + @stats_delete + Delete the region with the specified id. + + + region_id returned from @stats_create + + @stats_clear + Clear all the counters except the in-flight i/o counters. + + + region_id returned from @stats_create + + @stats_list [] + List all regions registered with @stats_create. + + + An optional parameter. + If this parameter is specified, only matching regions + are returned. + If it is not specified, all regions are returned. + + Output format: + : + + precise_timestamps histogram:n1,n2,n3,... + + The strings "precise_timestamps" and "histogram" are printed only + if they were specified when creating the region. + + @stats_print [ ] + Print counters for each step-sized area of a region. + + + region_id returned from @stats_create + + + The index of the starting line in the output. + If omitted, all lines are returned. + + + The number of lines to include in the output. + If omitted, all lines are returned. + + Output format for each step-sized area of a region: + + + + counters + + The first 11 counters have the same meaning as + `/sys/block/*/stat or /proc/diskstats`. + + Please refer to Documentation/iostats.txt for details. + + 1. the number of reads completed + 2. the number of reads merged + 3. the number of sectors read + 4. the number of milliseconds spent reading + 5. the number of writes completed + 6. the number of writes merged + 7. the number of sectors written + 8. the number of milliseconds spent writing + 9. the number of I/Os currently in progress + 10. the number of milliseconds spent doing I/Os + 11. the weighted number of milliseconds spent doing I/Os + + Additional counters: + + 12. the total time spent reading in milliseconds + 13. the total time spent writing in milliseconds + + @stats_print_clear [ ] + Atomically print and then clear all the counters except the + in-flight i/o counters. Useful when the client consuming the + statistics does not want to lose any statistics (those updated + between printing and clearing). + + + region_id returned from @stats_create + + + The index of the starting line in the output. + If omitted, all lines are printed and then cleared. + + + The number of lines to process. + If omitted, all lines are printed and then cleared. + + @stats_set_aux + Store auxiliary data aux_data for the specified region. + + + region_id returned from @stats_create + + + The string that identifies data which is useful to the client + program that created the range. The kernel returns this + string back in the output of @stats_list message, but it + doesn't use this value for anything. + +Examples +======== + +Subdivide the DM device 'vol' into 100 pieces and start collecting +statistics on them:: + + dmsetup message vol 0 @stats_create - /100 + +Set the auxiliary data string to "foo bar baz" (the escape for each +space must also be escaped, otherwise the shell will consume them):: + + dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz + +List the statistics:: + + dmsetup message vol 0 @stats_list + +Print the statistics:: + + dmsetup message vol 0 @stats_print 0 + +Delete the statistics:: + + dmsetup message vol 0 @stats_delete 0 diff --git a/Documentation/device-mapper/statistics.txt b/Documentation/device-mapper/statistics.txt deleted file mode 100644 index 170ac02a1f50..000000000000 --- a/Documentation/device-mapper/statistics.txt +++ /dev/null @@ -1,223 +0,0 @@ -DM statistics -============= - -Device Mapper supports the collection of I/O statistics on user-defined -regions of a DM device. If no regions are defined no statistics are -collected so there isn't any performance impact. Only bio-based DM -devices are currently supported. - -Each user-defined region specifies a starting sector, length and step. -Individual statistics will be collected for each step-sized area within -the range specified. - -The I/O statistics counters for each step-sized area of a region are -in the same format as /sys/block/*/stat or /proc/diskstats (see: -Documentation/iostats.txt). But two extra counters (12 and 13) are -provided: total time spent reading and writing. When the histogram -argument is used, the 14th parameter is reported that represents the -histogram of latencies. All these counters may be accessed by sending -the @stats_print message to the appropriate DM device via dmsetup. - -The reported times are in milliseconds and the granularity depends on -the kernel ticks. When the option precise_timestamps is used, the -reported times are in nanoseconds. - -Each region has a corresponding unique identifier, which we call a -region_id, that is assigned when the region is created. The region_id -must be supplied when querying statistics about the region, deleting the -region, etc. Unique region_ids enable multiple userspace programs to -request and process statistics for the same DM device without stepping -on each other's data. - -The creation of DM statistics will allocate memory via kmalloc or -fallback to using vmalloc space. At most, 1/4 of the overall system -memory may be allocated by DM statistics. The admin can see how much -memory is used by reading -/sys/module/dm_mod/parameters/stats_current_allocated_bytes - -Messages -======== - - @stats_create - [ ...] - [ []] - - Create a new region and return the region_id. - - - "-" - whole device - "+" - a range of 512-byte sectors - starting with . - - - "" - the range is subdivided into areas each containing - sectors. - "/" - the range is subdivided into the specified - number of areas. - - - The number of optional arguments - - - The following optional arguments are supported - precise_timestamps - use precise timer with nanosecond resolution - instead of the "jiffies" variable. When this argument is - used, the resulting times are in nanoseconds instead of - milliseconds. Precise timestamps are a little bit slower - to obtain than jiffies-based timestamps. - histogram:n1,n2,n3,n4,... - collect histogram of latencies. The - numbers n1, n2, etc are times that represent the boundaries - of the histogram. If precise_timestamps is not used, the - times are in milliseconds, otherwise they are in - nanoseconds. For each range, the kernel will report the - number of requests that completed within this range. For - example, if we use "histogram:10,20,30", the kernel will - report four numbers a:b:c:d. a is the number of requests - that took 0-10 ms to complete, b is the number of requests - that took 10-20 ms to complete, c is the number of requests - that took 20-30 ms to complete and d is the number of - requests that took more than 30 ms to complete. - - - An optional parameter. A name that uniquely identifies - the userspace owner of the range. This groups ranges together - so that userspace programs can identify the ranges they - created and ignore those created by others. - The kernel returns this string back in the output of - @stats_list message, but it doesn't use it for anything else. - If we omit the number of optional arguments, program id must not - be a number, otherwise it would be interpreted as the number of - optional arguments. - - - An optional parameter. A word that provides auxiliary data - that is useful to the client program that created the range. - The kernel returns this string back in the output of - @stats_list message, but it doesn't use this value for anything. - - @stats_delete - - Delete the region with the specified id. - - - region_id returned from @stats_create - - @stats_clear - - Clear all the counters except the in-flight i/o counters. - - - region_id returned from @stats_create - - @stats_list [] - - List all regions registered with @stats_create. - - - An optional parameter. - If this parameter is specified, only matching regions - are returned. - If it is not specified, all regions are returned. - - Output format: - : + - precise_timestamps histogram:n1,n2,n3,... - - The strings "precise_timestamps" and "histogram" are printed only - if they were specified when creating the region. - - @stats_print [ ] - - Print counters for each step-sized area of a region. - - - region_id returned from @stats_create - - - The index of the starting line in the output. - If omitted, all lines are returned. - - - The number of lines to include in the output. - If omitted, all lines are returned. - - Output format for each step-sized area of a region: - - + counters - - The first 11 counters have the same meaning as - /sys/block/*/stat or /proc/diskstats. - - Please refer to Documentation/iostats.txt for details. - - 1. the number of reads completed - 2. the number of reads merged - 3. the number of sectors read - 4. the number of milliseconds spent reading - 5. the number of writes completed - 6. the number of writes merged - 7. the number of sectors written - 8. the number of milliseconds spent writing - 9. the number of I/Os currently in progress - 10. the number of milliseconds spent doing I/Os - 11. the weighted number of milliseconds spent doing I/Os - - Additional counters: - 12. the total time spent reading in milliseconds - 13. the total time spent writing in milliseconds - - @stats_print_clear [ ] - - Atomically print and then clear all the counters except the - in-flight i/o counters. Useful when the client consuming the - statistics does not want to lose any statistics (those updated - between printing and clearing). - - - region_id returned from @stats_create - - - The index of the starting line in the output. - If omitted, all lines are printed and then cleared. - - - The number of lines to process. - If omitted, all lines are printed and then cleared. - - @stats_set_aux - - Store auxiliary data aux_data for the specified region. - - - region_id returned from @stats_create - - - The string that identifies data which is useful to the client - program that created the range. The kernel returns this - string back in the output of @stats_list message, but it - doesn't use this value for anything. - -Examples -======== - -Subdivide the DM device 'vol' into 100 pieces and start collecting -statistics on them: - - dmsetup message vol 0 @stats_create - /100 - -Set the auxiliary data string to "foo bar baz" (the escape for each -space must also be escaped, otherwise the shell will consume them): - - dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz - -List the statistics: - - dmsetup message vol 0 @stats_list - -Print the statistics: - - dmsetup message vol 0 @stats_print 0 - -Delete the statistics: - - dmsetup message vol 0 @stats_delete 0 diff --git a/Documentation/device-mapper/striped.rst b/Documentation/device-mapper/striped.rst new file mode 100644 index 000000000000..e9a8da192ae1 --- /dev/null +++ b/Documentation/device-mapper/striped.rst @@ -0,0 +1,61 @@ +========= +dm-stripe +========= + +Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0) +device across one or more underlying devices. Data is written in "chunks", +with consecutive chunks rotating among the underlying devices. This can +potentially provide improved I/O throughput by utilizing several physical +devices in parallel. + +Parameters: [ ]+ + : + Number of underlying devices. + : + Size of each chunk of data. Must be at least as + large as the system's PAGE_SIZE. + : + Full pathname to the underlying block-device, or a + "major:minor" device-number. + : + Starting sector within the device. + +One or more underlying devices can be specified. The striped device size must +be a multiple of the chunk size multiplied by the number of underlying devices. + + +Example scripts +=============== + +:: + + #!/usr/bin/perl -w + # Create a striped device across any number of underlying devices. The device + # will be called "stripe_dev" and have a chunk-size of 128k. + + my $chunk_size = 128 * 2; + my $dev_name = "stripe_dev"; + my $num_devs = @ARGV; + my @devs = @ARGV; + my ($min_dev_size, $stripe_dev_size, $i); + + if (!$num_devs) { + die("Specify at least one device\n"); + } + + $min_dev_size = `blockdev --getsz $devs[0]`; + for ($i = 1; $i < $num_devs; $i++) { + my $this_size = `blockdev --getsz $devs[$i]`; + $min_dev_size = ($min_dev_size < $this_size) ? + $min_dev_size : $this_size; + } + + $stripe_dev_size = $min_dev_size * $num_devs; + $stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs); + + $table = "0 $stripe_dev_size striped $num_devs $chunk_size"; + for ($i = 0; $i < $num_devs; $i++) { + $table .= " $devs[$i] 0"; + } + + `echo $table | dmsetup create $dev_name`; diff --git a/Documentation/device-mapper/striped.txt b/Documentation/device-mapper/striped.txt deleted file mode 100644 index 07ec492cceee..000000000000 --- a/Documentation/device-mapper/striped.txt +++ /dev/null @@ -1,57 +0,0 @@ -dm-stripe -========= - -Device-Mapper's "striped" target is used to create a striped (i.e. RAID-0) -device across one or more underlying devices. Data is written in "chunks", -with consecutive chunks rotating among the underlying devices. This can -potentially provide improved I/O throughput by utilizing several physical -devices in parallel. - -Parameters: [ ]+ - : Number of underlying devices. - : Size of each chunk of data. Must be at least as - large as the system's PAGE_SIZE. - : Full pathname to the underlying block-device, or a - "major:minor" device-number. - : Starting sector within the device. - -One or more underlying devices can be specified. The striped device size must -be a multiple of the chunk size multiplied by the number of underlying devices. - - -Example scripts -=============== - -[[ -#!/usr/bin/perl -w -# Create a striped device across any number of underlying devices. The device -# will be called "stripe_dev" and have a chunk-size of 128k. - -my $chunk_size = 128 * 2; -my $dev_name = "stripe_dev"; -my $num_devs = @ARGV; -my @devs = @ARGV; -my ($min_dev_size, $stripe_dev_size, $i); - -if (!$num_devs) { - die("Specify at least one device\n"); -} - -$min_dev_size = `blockdev --getsz $devs[0]`; -for ($i = 1; $i < $num_devs; $i++) { - my $this_size = `blockdev --getsz $devs[$i]`; - $min_dev_size = ($min_dev_size < $this_size) ? - $min_dev_size : $this_size; -} - -$stripe_dev_size = $min_dev_size * $num_devs; -$stripe_dev_size -= $stripe_dev_size % ($chunk_size * $num_devs); - -$table = "0 $stripe_dev_size striped $num_devs $chunk_size"; -for ($i = 0; $i < $num_devs; $i++) { - $table .= " $devs[$i] 0"; -} - -`echo $table | dmsetup create $dev_name`; -]] - diff --git a/Documentation/device-mapper/switch.rst b/Documentation/device-mapper/switch.rst new file mode 100644 index 000000000000..7dde06be1a4f --- /dev/null +++ b/Documentation/device-mapper/switch.rst @@ -0,0 +1,141 @@ +========= +dm-switch +========= + +The device-mapper switch target creates a device that supports an +arbitrary mapping of fixed-size regions of I/O across a fixed set of +paths. The path used for any specific region can be switched +dynamically by sending the target a message. + +It maps I/O to underlying block devices efficiently when there is a large +number of fixed-sized address regions but there is no simple pattern +that would allow for a compact representation of the mapping such as +dm-stripe. + +Background +---------- + +Dell EqualLogic and some other iSCSI storage arrays use a distributed +frameless architecture. In this architecture, the storage group +consists of a number of distinct storage arrays ("members") each having +independent controllers, disk storage and network adapters. When a LUN +is created it is spread across multiple members. The details of the +spreading are hidden from initiators connected to this storage system. +The storage group exposes a single target discovery portal, no matter +how many members are being used. When iSCSI sessions are created, each +session is connected to an eth port on a single member. Data to a LUN +can be sent on any iSCSI session, and if the blocks being accessed are +stored on another member the I/O will be forwarded as required. This +forwarding is invisible to the initiator. The storage layout is also +dynamic, and the blocks stored on disk may be moved from member to +member as needed to balance the load. + +This architecture simplifies the management and configuration of both +the storage group and initiators. In a multipathing configuration, it +is possible to set up multiple iSCSI sessions to use multiple network +interfaces on both the host and target to take advantage of the +increased network bandwidth. An initiator could use a simple round +robin algorithm to send I/O across all paths and let the storage array +members forward it as necessary, but there is a performance advantage to +sending data directly to the correct member. + +A device-mapper table already lets you map different regions of a +device onto different targets. However in this architecture the LUN is +spread with an address region size on the order of 10s of MBs, which +means the resulting table could have more than a million entries and +consume far too much memory. + +Using this device-mapper switch target we can now build a two-layer +device hierarchy: + + Upper Tier - Determine which array member the I/O should be sent to. + Lower Tier - Load balance amongst paths to a particular member. + +The lower tier consists of a single dm multipath device for each member. +Each of these multipath devices contains the set of paths directly to +the array member in one priority group, and leverages existing path +selectors to load balance amongst these paths. We also build a +non-preferred priority group containing paths to other array members for +failover reasons. + +The upper tier consists of a single dm-switch device. This device uses +a bitmap to look up the location of the I/O and choose the appropriate +lower tier device to route the I/O. By using a bitmap we are able to +use 4 bits for each address range in a 16 member group (which is very +large for us). This is a much denser representation than the dm table +b-tree can achieve. + +Construction Parameters +======================= + + [...] [ ]+ + + The number of paths across which to distribute the I/O. + + + The number of 512-byte sectors in a region. Each region can be redirected + to any of the available paths. + + + The number of optional arguments. Currently, no optional arguments + are supported and so this must be zero. + + + The block device that represents a specific path to the device. + + + The offset of the start of data on the specific (in units + of 512-byte sectors). This number is added to the sector number when + forwarding the request to the specific path. Typically it is zero. + +Messages +======== + +set_region_mappings : []: []:... + +Modify the region table by specifying which regions are redirected to +which paths. + + + The region number (region size was specified in constructor parameters). + If index is omitted, the next region (previous index + 1) is used. + Expressed in hexadecimal (WITHOUT any prefix like 0x). + + + The path number in the range 0 ... ( - 1). + Expressed in hexadecimal (WITHOUT any prefix like 0x). + +R, + This parameter allows repetitive patterns to be loaded quickly. and + are hexadecimal numbers. The last mappings are repeated in the next + slots. + +Status +====== + +No status line is reported. + +Example +======= + +Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with +the same size. + +Create a switch device with 64kB region size:: + + dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` + switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" + +Set mappings for the first 7 entries to point to devices switch0, switch1, +switch2, switch0, switch1, switch2, switch1:: + + dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 + +Set repetitive mapping. This command:: + + dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 + +is equivalent to:: + + dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ + :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 diff --git a/Documentation/device-mapper/switch.txt b/Documentation/device-mapper/switch.txt deleted file mode 100644 index 5bd4831db4a8..000000000000 --- a/Documentation/device-mapper/switch.txt +++ /dev/null @@ -1,138 +0,0 @@ -dm-switch -========= - -The device-mapper switch target creates a device that supports an -arbitrary mapping of fixed-size regions of I/O across a fixed set of -paths. The path used for any specific region can be switched -dynamically by sending the target a message. - -It maps I/O to underlying block devices efficiently when there is a large -number of fixed-sized address regions but there is no simple pattern -that would allow for a compact representation of the mapping such as -dm-stripe. - -Background ----------- - -Dell EqualLogic and some other iSCSI storage arrays use a distributed -frameless architecture. In this architecture, the storage group -consists of a number of distinct storage arrays ("members") each having -independent controllers, disk storage and network adapters. When a LUN -is created it is spread across multiple members. The details of the -spreading are hidden from initiators connected to this storage system. -The storage group exposes a single target discovery portal, no matter -how many members are being used. When iSCSI sessions are created, each -session is connected to an eth port on a single member. Data to a LUN -can be sent on any iSCSI session, and if the blocks being accessed are -stored on another member the I/O will be forwarded as required. This -forwarding is invisible to the initiator. The storage layout is also -dynamic, and the blocks stored on disk may be moved from member to -member as needed to balance the load. - -This architecture simplifies the management and configuration of both -the storage group and initiators. In a multipathing configuration, it -is possible to set up multiple iSCSI sessions to use multiple network -interfaces on both the host and target to take advantage of the -increased network bandwidth. An initiator could use a simple round -robin algorithm to send I/O across all paths and let the storage array -members forward it as necessary, but there is a performance advantage to -sending data directly to the correct member. - -A device-mapper table already lets you map different regions of a -device onto different targets. However in this architecture the LUN is -spread with an address region size on the order of 10s of MBs, which -means the resulting table could have more than a million entries and -consume far too much memory. - -Using this device-mapper switch target we can now build a two-layer -device hierarchy: - - Upper Tier - Determine which array member the I/O should be sent to. - Lower Tier - Load balance amongst paths to a particular member. - -The lower tier consists of a single dm multipath device for each member. -Each of these multipath devices contains the set of paths directly to -the array member in one priority group, and leverages existing path -selectors to load balance amongst these paths. We also build a -non-preferred priority group containing paths to other array members for -failover reasons. - -The upper tier consists of a single dm-switch device. This device uses -a bitmap to look up the location of the I/O and choose the appropriate -lower tier device to route the I/O. By using a bitmap we are able to -use 4 bits for each address range in a 16 member group (which is very -large for us). This is a much denser representation than the dm table -b-tree can achieve. - -Construction Parameters -======================= - - [...] - [ ]+ - - - The number of paths across which to distribute the I/O. - - - The number of 512-byte sectors in a region. Each region can be redirected - to any of the available paths. - - - The number of optional arguments. Currently, no optional arguments - are supported and so this must be zero. - - - The block device that represents a specific path to the device. - - - The offset of the start of data on the specific (in units - of 512-byte sectors). This number is added to the sector number when - forwarding the request to the specific path. Typically it is zero. - -Messages -======== - -set_region_mappings : []: []:... - -Modify the region table by specifying which regions are redirected to -which paths. - - - The region number (region size was specified in constructor parameters). - If index is omitted, the next region (previous index + 1) is used. - Expressed in hexadecimal (WITHOUT any prefix like 0x). - - - The path number in the range 0 ... ( - 1). - Expressed in hexadecimal (WITHOUT any prefix like 0x). - -R, - This parameter allows repetitive patterns to be loaded quickly. and - are hexadecimal numbers. The last mappings are repeated in the next - slots. - -Status -====== - -No status line is reported. - -Example -======= - -Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with -the same size. - -Create a switch device with 64kB region size: - dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` - switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" - -Set mappings for the first 7 entries to point to devices switch0, switch1, -switch2, switch0, switch1, switch2, switch1: - dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 - -Set repetitive mapping. This command: - dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 -is equivalent to: - dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ - :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 - diff --git a/Documentation/device-mapper/thin-provisioning.rst b/Documentation/device-mapper/thin-provisioning.rst new file mode 100644 index 000000000000..bafebf79da4b --- /dev/null +++ b/Documentation/device-mapper/thin-provisioning.rst @@ -0,0 +1,427 @@ +================= +Thin provisioning +================= + +Introduction +============ + +This document describes a collection of device-mapper targets that +between them implement thin-provisioning and snapshots. + +The main highlight of this implementation, compared to the previous +implementation of snapshots, is that it allows many virtual devices to +be stored on the same data volume. This simplifies administration and +allows the sharing of data between volumes, thus reducing disk usage. + +Another significant feature is support for an arbitrary depth of +recursive snapshots (snapshots of snapshots of snapshots ...). The +previous implementation of snapshots did this by chaining together +lookup tables, and so performance was O(depth). This new +implementation uses a single data structure to avoid this degradation +with depth. Fragmentation may still be an issue, however, in some +scenarios. + +Metadata is stored on a separate device from data, giving the +administrator some freedom, for example to: + +- Improve metadata resilience by storing metadata on a mirrored volume + but data on a non-mirrored one. + +- Improve performance by storing the metadata on SSD. + +Status +====== + +These targets are considered safe for production use. But different use +cases will have different performance characteristics, for example due +to fragmentation of the data volume. + +If you find this software is not performing as expected please mail +dm-devel@redhat.com with details and we'll try our best to improve +things for you. + +Userspace tools for checking and repairing the metadata have been fully +developed and are available as 'thin_check' and 'thin_repair'. The name +of the package that provides these utilities varies by distribution (on +a Red Hat distribution it is named 'device-mapper-persistent-data'). + +Cookbook +======== + +This section describes some quick recipes for using thin provisioning. +They use the dmsetup program to control the device-mapper driver +directly. End users will be advised to use a higher-level volume +manager such as LVM2 once support has been added. + +Pool device +----------- + +The pool device ties together the metadata volume and the data volume. +It maps I/O linearly to the data volume and updates the metadata via +two mechanisms: + +- Function calls from the thin targets + +- Device-mapper 'messages' from userspace which control the creation of new + virtual devices amongst other things. + +Setting up a fresh pool device +------------------------------ + +Setting up a pool device requires a valid metadata device, and a +data device. If you do not have an existing metadata device you can +make one by zeroing the first 4k to indicate empty metadata. + + dd if=/dev/zero of=$metadata_dev bs=4096 count=1 + +The amount of metadata you need will vary according to how many blocks +are shared between thin devices (i.e. through snapshots). If you have +less sharing than average you'll need a larger-than-average metadata device. + +As a guide, we suggest you calculate the number of bytes to use in the +metadata device as 48 * $data_dev_size / $data_block_size but round it up +to 2MB if the answer is smaller. If you're creating large numbers of +snapshots which are recording large amounts of change, you may find you +need to increase this. + +The largest size supported is 16GB: If the device is larger, +a warning will be issued and the excess space will not be used. + +Reloading a pool table +---------------------- + +You may reload a pool's table, indeed this is how the pool is resized +if it runs out of space. (N.B. While specifying a different metadata +device when reloading is not forbidden at the moment, things will go +wrong if it does not route I/O to exactly the same on-disk location as +previously.) + +Using an existing pool device +----------------------------- + +:: + + dmsetup create pool \ + --table "0 20971520 thin-pool $metadata_dev $data_dev \ + $data_block_size $low_water_mark" + +$data_block_size gives the smallest unit of disk space that can be +allocated at a time expressed in units of 512-byte sectors. +$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a +multiple of 128 (64KB). $data_block_size cannot be changed after the +thin-pool is created. People primarily interested in thin provisioning +may want to use a value such as 1024 (512KB). People doing lots of +snapshotting may want a smaller value such as 128 (64KB). If you are +not zeroing newly-allocated data, a larger $data_block_size in the +region of 256000 (128MB) is suggested. + +$low_water_mark is expressed in blocks of size $data_block_size. If +free space on the data device drops below this level then a dm event +will be triggered which a userspace daemon should catch allowing it to +extend the pool device. Only one such event will be sent. + +No special event is triggered if a just resumed device's free space is below +the low water mark. However, resuming a device always triggers an +event; a userspace daemon should verify that free space exceeds the low +water mark when handling this event. + +A low water mark for the metadata device is maintained in the kernel and +will trigger a dm event if free space on the metadata device drops below +it. + +Updating on-disk metadata +------------------------- + +On-disk metadata is committed every time a FLUSH or FUA bio is written. +If no such requests are made then commits will occur every second. This +means the thin-provisioning target behaves like a physical disk that has +a volatile write cache. If power is lost you may lose some recent +writes. The metadata should always be consistent in spite of any crash. + +If data space is exhausted the pool will either error or queue IO +according to the configuration (see: error_if_no_space). If metadata +space is exhausted or a metadata operation fails: the pool will error IO +until the pool is taken offline and repair is performed to 1) fix any +potential inconsistencies and 2) clear the flag that imposes repair. +Once the pool's metadata device is repaired it may be resized, which +will allow the pool to return to normal operation. Note that if a pool +is flagged as needing repair, the pool's data and metadata devices +cannot be resized until repair is performed. It should also be noted +that when the pool's metadata space is exhausted the current metadata +transaction is aborted. Given that the pool will cache IO whose +completion may have already been acknowledged to upper IO layers +(e.g. filesystem) it is strongly suggested that consistency checks +(e.g. fsck) be performed on those layers when repair of the pool is +required. + +Thin provisioning +----------------- + +i) Creating a new thinly-provisioned volume. + + To create a new thinly- provisioned volume you must send a message to an + active pool device, /dev/mapper/pool in this example:: + + dmsetup message /dev/mapper/pool 0 "create_thin 0" + + Here '0' is an identifier for the volume, a 24-bit number. It's up + to the caller to allocate and manage these identifiers. If the + identifier is already in use, the message will fail with -EEXIST. + +ii) Using a thinly-provisioned volume. + + Thinly-provisioned volumes are activated using the 'thin' target:: + + dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" + + The last parameter is the identifier for the thinp device. + +Internal snapshots +------------------ + +i) Creating an internal snapshot. + + Snapshots are created with another message to the pool. + + N.B. If the origin device that you wish to snapshot is active, you + must suspend it before creating the snapshot to avoid corruption. + This is NOT enforced at the moment, so please be careful! + + :: + + dmsetup suspend /dev/mapper/thin + dmsetup message /dev/mapper/pool 0 "create_snap 1 0" + dmsetup resume /dev/mapper/thin + + Here '1' is the identifier for the volume, a 24-bit number. '0' is the + identifier for the origin device. + +ii) Using an internal snapshot. + + Once created, the user doesn't have to worry about any connection + between the origin and the snapshot. Indeed the snapshot is no + different from any other thinly-provisioned device and can be + snapshotted itself via the same method. It's perfectly legal to + have only one of them active, and there's no ordering requirement on + activating or removing them both. (This differs from conventional + device-mapper snapshots.) + + Activate it exactly the same way as any other thinly-provisioned volume:: + + dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" + +External snapshots +------------------ + +You can use an external **read only** device as an origin for a +thinly-provisioned volume. Any read to an unprovisioned area of the +thin device will be passed through to the origin. Writes trigger +the allocation of new blocks as usual. + +One use case for this is VM hosts that want to run guests on +thinly-provisioned volumes but have the base image on another device +(possibly shared between many VMs). + +You must not write to the origin device if you use this technique! +Of course, you may write to the thin device and take internal snapshots +of the thin volume. + +i) Creating a snapshot of an external device + + This is the same as creating a thin device. + You don't mention the origin at this stage. + + :: + + dmsetup message /dev/mapper/pool 0 "create_thin 0" + +ii) Using a snapshot of an external device. + + Append an extra parameter to the thin target specifying the origin:: + + dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" + + N.B. All descendants (internal snapshots) of this snapshot require the + same extra origin parameter. + +Deactivation +------------ + +All devices using a pool must be deactivated before the pool itself +can be. + +:: + + dmsetup remove thin + dmsetup remove snap + dmsetup remove pool + +Reference +========= + +'thin-pool' target +------------------ + +i) Constructor + + :: + + thin-pool \ + [ []*] + + Optional feature arguments: + + skip_block_zeroing: + Skip the zeroing of newly-provisioned blocks. + + ignore_discard: + Disable discard support. + + no_discard_passdown: + Don't pass discards down to the underlying + data device, but just remove the mapping. + + read_only: + Don't allow any changes to be made to the pool + metadata. This mode is only available after the + thin-pool has been created and first used in full + read/write mode. It cannot be specified on initial + thin-pool creation. + + error_if_no_space: + Error IOs, instead of queueing, if no space. + + Data block size must be between 64KB (128 sectors) and 1GB + (2097152 sectors) inclusive. + + +ii) Status + + :: + + / + / + ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space + needs_check|- metadata_low_watermark + + transaction id: + A 64-bit number used by userspace to help synchronise with metadata + from volume managers. + + used data blocks / total data blocks + If the number of free blocks drops below the pool's low water mark a + dm event will be sent to userspace. This event is edge-triggered and + it will occur only once after each resume so volume manager writers + should register for the event and then check the target's status. + + held metadata root: + The location, in blocks, of the metadata root that has been + 'held' for userspace read access. '-' indicates there is no + held root. + + discard_passdown|no_discard_passdown + Whether or not discards are actually being passed down to the + underlying device. When this is enabled when loading the table, + it can get disabled if the underlying device doesn't support it. + + ro|rw|out_of_data_space + If the pool encounters certain types of device failures it will + drop into a read-only metadata mode in which no changes to + the pool metadata (like allocating new blocks) are permitted. + + In serious cases where even a read-only mode is deemed unsafe + no further I/O will be permitted and the status will just + contain the string 'Fail'. The userspace recovery tools + should then be used. + + error_if_no_space|queue_if_no_space + If the pool runs out of data or metadata space, the pool will + either queue or error the IO destined to the data device. The + default is to queue the IO until more space is added or the + 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool + module parameter can be used to change this timeout -- it + defaults to 60 seconds but may be disabled using a value of 0. + + needs_check + A metadata operation has failed, resulting in the needs_check + flag being set in the metadata's superblock. The metadata + device must be deactivated and checked/repaired before the + thin-pool can be made fully operational again. '-' indicates + needs_check is not set. + + metadata_low_watermark: + Value of metadata low watermark in blocks. The kernel sets this + value internally but userspace needs to know this value to + determine if an event was caused by crossing this threshold. + +iii) Messages + + create_thin + Create a new thinly-provisioned device. + is an arbitrary unique 24-bit identifier chosen by + the caller. + + create_snap + Create a new snapshot of another thinly-provisioned device. + is an arbitrary unique 24-bit identifier chosen by + the caller. + is the identifier of the thinly-provisioned device + of which the new device will be a snapshot. + + delete + Deletes a thin device. Irreversible. + + set_transaction_id + Userland volume managers, such as LVM, need a way to + synchronise their external metadata with the internal metadata of the + pool target. The thin-pool target offers to store an + arbitrary 64-bit transaction id and return it on the target's + status line. To avoid races you must provide what you think + the current transaction id is when you change it with this + compare-and-swap message. + + reserve_metadata_snap + Reserve a copy of the data mapping btree for use by userland. + This allows userland to inspect the mappings as they were when + this message was executed. Use the pool's status command to + get the root block associated with the metadata snapshot. + + release_metadata_snap + Release a previously reserved copy of the data mapping btree. + +'thin' target +------------- + +i) Constructor + + :: + + thin [] + + pool dev: + the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 + + dev id: + the internal device identifier of the device to be + activated. + + external origin dev: + an optional block device outside the pool to be treated as a + read-only snapshot origin: reads to unprovisioned areas of the + thin target will be mapped to this device. + +The pool doesn't store any size against the thin devices. If you +load a thin target that is smaller than you've been using previously, +then you'll have no access to blocks mapped beyond the end. If you +load a target that is bigger than before, then extra blocks will be +provisioned as and when needed. + +ii) Status + + + If the pool has encountered device errors and failed, the status + will just contain the string 'Fail'. The userspace recovery + tools should then be used. + + In the case where is 0, there is no highest + mapped sector and the value of is unspecified. diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt deleted file mode 100644 index 883e7ca5f745..000000000000 --- a/Documentation/device-mapper/thin-provisioning.txt +++ /dev/null @@ -1,411 +0,0 @@ -Introduction -============ - -This document describes a collection of device-mapper targets that -between them implement thin-provisioning and snapshots. - -The main highlight of this implementation, compared to the previous -implementation of snapshots, is that it allows many virtual devices to -be stored on the same data volume. This simplifies administration and -allows the sharing of data between volumes, thus reducing disk usage. - -Another significant feature is support for an arbitrary depth of -recursive snapshots (snapshots of snapshots of snapshots ...). The -previous implementation of snapshots did this by chaining together -lookup tables, and so performance was O(depth). This new -implementation uses a single data structure to avoid this degradation -with depth. Fragmentation may still be an issue, however, in some -scenarios. - -Metadata is stored on a separate device from data, giving the -administrator some freedom, for example to: - -- Improve metadata resilience by storing metadata on a mirrored volume - but data on a non-mirrored one. - -- Improve performance by storing the metadata on SSD. - -Status -====== - -These targets are considered safe for production use. But different use -cases will have different performance characteristics, for example due -to fragmentation of the data volume. - -If you find this software is not performing as expected please mail -dm-devel@redhat.com with details and we'll try our best to improve -things for you. - -Userspace tools for checking and repairing the metadata have been fully -developed and are available as 'thin_check' and 'thin_repair'. The name -of the package that provides these utilities varies by distribution (on -a Red Hat distribution it is named 'device-mapper-persistent-data'). - -Cookbook -======== - -This section describes some quick recipes for using thin provisioning. -They use the dmsetup program to control the device-mapper driver -directly. End users will be advised to use a higher-level volume -manager such as LVM2 once support has been added. - -Pool device ------------ - -The pool device ties together the metadata volume and the data volume. -It maps I/O linearly to the data volume and updates the metadata via -two mechanisms: - -- Function calls from the thin targets - -- Device-mapper 'messages' from userspace which control the creation of new - virtual devices amongst other things. - -Setting up a fresh pool device ------------------------------- - -Setting up a pool device requires a valid metadata device, and a -data device. If you do not have an existing metadata device you can -make one by zeroing the first 4k to indicate empty metadata. - - dd if=/dev/zero of=$metadata_dev bs=4096 count=1 - -The amount of metadata you need will vary according to how many blocks -are shared between thin devices (i.e. through snapshots). If you have -less sharing than average you'll need a larger-than-average metadata device. - -As a guide, we suggest you calculate the number of bytes to use in the -metadata device as 48 * $data_dev_size / $data_block_size but round it up -to 2MB if the answer is smaller. If you're creating large numbers of -snapshots which are recording large amounts of change, you may find you -need to increase this. - -The largest size supported is 16GB: If the device is larger, -a warning will be issued and the excess space will not be used. - -Reloading a pool table ----------------------- - -You may reload a pool's table, indeed this is how the pool is resized -if it runs out of space. (N.B. While specifying a different metadata -device when reloading is not forbidden at the moment, things will go -wrong if it does not route I/O to exactly the same on-disk location as -previously.) - -Using an existing pool device ------------------------------ - - dmsetup create pool \ - --table "0 20971520 thin-pool $metadata_dev $data_dev \ - $data_block_size $low_water_mark" - -$data_block_size gives the smallest unit of disk space that can be -allocated at a time expressed in units of 512-byte sectors. -$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a -multiple of 128 (64KB). $data_block_size cannot be changed after the -thin-pool is created. People primarily interested in thin provisioning -may want to use a value such as 1024 (512KB). People doing lots of -snapshotting may want a smaller value such as 128 (64KB). If you are -not zeroing newly-allocated data, a larger $data_block_size in the -region of 256000 (128MB) is suggested. - -$low_water_mark is expressed in blocks of size $data_block_size. If -free space on the data device drops below this level then a dm event -will be triggered which a userspace daemon should catch allowing it to -extend the pool device. Only one such event will be sent. - -No special event is triggered if a just resumed device's free space is below -the low water mark. However, resuming a device always triggers an -event; a userspace daemon should verify that free space exceeds the low -water mark when handling this event. - -A low water mark for the metadata device is maintained in the kernel and -will trigger a dm event if free space on the metadata device drops below -it. - -Updating on-disk metadata -------------------------- - -On-disk metadata is committed every time a FLUSH or FUA bio is written. -If no such requests are made then commits will occur every second. This -means the thin-provisioning target behaves like a physical disk that has -a volatile write cache. If power is lost you may lose some recent -writes. The metadata should always be consistent in spite of any crash. - -If data space is exhausted the pool will either error or queue IO -according to the configuration (see: error_if_no_space). If metadata -space is exhausted or a metadata operation fails: the pool will error IO -until the pool is taken offline and repair is performed to 1) fix any -potential inconsistencies and 2) clear the flag that imposes repair. -Once the pool's metadata device is repaired it may be resized, which -will allow the pool to return to normal operation. Note that if a pool -is flagged as needing repair, the pool's data and metadata devices -cannot be resized until repair is performed. It should also be noted -that when the pool's metadata space is exhausted the current metadata -transaction is aborted. Given that the pool will cache IO whose -completion may have already been acknowledged to upper IO layers -(e.g. filesystem) it is strongly suggested that consistency checks -(e.g. fsck) be performed on those layers when repair of the pool is -required. - -Thin provisioning ------------------ - -i) Creating a new thinly-provisioned volume. - - To create a new thinly- provisioned volume you must send a message to an - active pool device, /dev/mapper/pool in this example. - - dmsetup message /dev/mapper/pool 0 "create_thin 0" - - Here '0' is an identifier for the volume, a 24-bit number. It's up - to the caller to allocate and manage these identifiers. If the - identifier is already in use, the message will fail with -EEXIST. - -ii) Using a thinly-provisioned volume. - - Thinly-provisioned volumes are activated using the 'thin' target: - - dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" - - The last parameter is the identifier for the thinp device. - -Internal snapshots ------------------- - -i) Creating an internal snapshot. - - Snapshots are created with another message to the pool. - - N.B. If the origin device that you wish to snapshot is active, you - must suspend it before creating the snapshot to avoid corruption. - This is NOT enforced at the moment, so please be careful! - - dmsetup suspend /dev/mapper/thin - dmsetup message /dev/mapper/pool 0 "create_snap 1 0" - dmsetup resume /dev/mapper/thin - - Here '1' is the identifier for the volume, a 24-bit number. '0' is the - identifier for the origin device. - -ii) Using an internal snapshot. - - Once created, the user doesn't have to worry about any connection - between the origin and the snapshot. Indeed the snapshot is no - different from any other thinly-provisioned device and can be - snapshotted itself via the same method. It's perfectly legal to - have only one of them active, and there's no ordering requirement on - activating or removing them both. (This differs from conventional - device-mapper snapshots.) - - Activate it exactly the same way as any other thinly-provisioned volume: - - dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" - -External snapshots ------------------- - -You can use an external _read only_ device as an origin for a -thinly-provisioned volume. Any read to an unprovisioned area of the -thin device will be passed through to the origin. Writes trigger -the allocation of new blocks as usual. - -One use case for this is VM hosts that want to run guests on -thinly-provisioned volumes but have the base image on another device -(possibly shared between many VMs). - -You must not write to the origin device if you use this technique! -Of course, you may write to the thin device and take internal snapshots -of the thin volume. - -i) Creating a snapshot of an external device - - This is the same as creating a thin device. - You don't mention the origin at this stage. - - dmsetup message /dev/mapper/pool 0 "create_thin 0" - -ii) Using a snapshot of an external device. - - Append an extra parameter to the thin target specifying the origin: - - dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" - - N.B. All descendants (internal snapshots) of this snapshot require the - same extra origin parameter. - -Deactivation ------------- - -All devices using a pool must be deactivated before the pool itself -can be. - - dmsetup remove thin - dmsetup remove snap - dmsetup remove pool - -Reference -========= - -'thin-pool' target ------------------- - -i) Constructor - - thin-pool \ - [ []*] - - Optional feature arguments: - - skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. - - ignore_discard: Disable discard support. - - no_discard_passdown: Don't pass discards down to the underlying - data device, but just remove the mapping. - - read_only: Don't allow any changes to be made to the pool - metadata. This mode is only available after the - thin-pool has been created and first used in full - read/write mode. It cannot be specified on initial - thin-pool creation. - - error_if_no_space: Error IOs, instead of queueing, if no space. - - Data block size must be between 64KB (128 sectors) and 1GB - (2097152 sectors) inclusive. - - -ii) Status - - / - / - ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space - needs_check|- metadata_low_watermark - - transaction id: - A 64-bit number used by userspace to help synchronise with metadata - from volume managers. - - used data blocks / total data blocks - If the number of free blocks drops below the pool's low water mark a - dm event will be sent to userspace. This event is edge-triggered and - it will occur only once after each resume so volume manager writers - should register for the event and then check the target's status. - - held metadata root: - The location, in blocks, of the metadata root that has been - 'held' for userspace read access. '-' indicates there is no - held root. - - discard_passdown|no_discard_passdown - Whether or not discards are actually being passed down to the - underlying device. When this is enabled when loading the table, - it can get disabled if the underlying device doesn't support it. - - ro|rw|out_of_data_space - If the pool encounters certain types of device failures it will - drop into a read-only metadata mode in which no changes to - the pool metadata (like allocating new blocks) are permitted. - - In serious cases where even a read-only mode is deemed unsafe - no further I/O will be permitted and the status will just - contain the string 'Fail'. The userspace recovery tools - should then be used. - - error_if_no_space|queue_if_no_space - If the pool runs out of data or metadata space, the pool will - either queue or error the IO destined to the data device. The - default is to queue the IO until more space is added or the - 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool - module parameter can be used to change this timeout -- it - defaults to 60 seconds but may be disabled using a value of 0. - - needs_check - A metadata operation has failed, resulting in the needs_check - flag being set in the metadata's superblock. The metadata - device must be deactivated and checked/repaired before the - thin-pool can be made fully operational again. '-' indicates - needs_check is not set. - - metadata_low_watermark: - Value of metadata low watermark in blocks. The kernel sets this - value internally but userspace needs to know this value to - determine if an event was caused by crossing this threshold. - -iii) Messages - - create_thin - - Create a new thinly-provisioned device. - is an arbitrary unique 24-bit identifier chosen by - the caller. - - create_snap - - Create a new snapshot of another thinly-provisioned device. - is an arbitrary unique 24-bit identifier chosen by - the caller. - is the identifier of the thinly-provisioned device - of which the new device will be a snapshot. - - delete - - Deletes a thin device. Irreversible. - - set_transaction_id - - Userland volume managers, such as LVM, need a way to - synchronise their external metadata with the internal metadata of the - pool target. The thin-pool target offers to store an - arbitrary 64-bit transaction id and return it on the target's - status line. To avoid races you must provide what you think - the current transaction id is when you change it with this - compare-and-swap message. - - reserve_metadata_snap - - Reserve a copy of the data mapping btree for use by userland. - This allows userland to inspect the mappings as they were when - this message was executed. Use the pool's status command to - get the root block associated with the metadata snapshot. - - release_metadata_snap - - Release a previously reserved copy of the data mapping btree. - -'thin' target -------------- - -i) Constructor - - thin [] - - pool dev: - the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 - - dev id: - the internal device identifier of the device to be - activated. - - external origin dev: - an optional block device outside the pool to be treated as a - read-only snapshot origin: reads to unprovisioned areas of the - thin target will be mapped to this device. - -The pool doesn't store any size against the thin devices. If you -load a thin target that is smaller than you've been using previously, -then you'll have no access to blocks mapped beyond the end. If you -load a target that is bigger than before, then extra blocks will be -provisioned as and when needed. - -ii) Status - - - - If the pool has encountered device errors and failed, the status - will just contain the string 'Fail'. The userspace recovery - tools should then be used. - - In the case where is 0, there is no highest - mapped sector and the value of is unspecified. diff --git a/Documentation/device-mapper/unstriped.rst b/Documentation/device-mapper/unstriped.rst new file mode 100644 index 000000000000..0a8d3eb3f072 --- /dev/null +++ b/Documentation/device-mapper/unstriped.rst @@ -0,0 +1,135 @@ +================================ +Device-mapper "unstriped" target +================================ + +Introduction +============ + +The device-mapper "unstriped" target provides a transparent mechanism to +unstripe a device-mapper "striped" target to access the underlying disks +without having to touch the true backing block-device. It can also be +used to unstripe a hardware RAID-0 to access backing disks. + +Parameters: + + + + The number of stripes in the RAID 0. + + + The amount of 512B sectors in the chunk striping. + + + The block device you wish to unstripe. + + + The stripe number within the device that corresponds to physical + drive you wish to unstripe. This must be 0 indexed. + + +Why use this module? +==================== + +An example of undoing an existing dm-stripe +------------------------------------------- + +This small bash script will setup 4 loop devices and use the existing +striped target to combine the 4 devices into one. It then will use +the unstriped target ontop of the striped device to access the +individual backing loop devices. We write data to the newly exposed +unstriped devices and verify the data written matches the correct +underlying device on the striped array:: + + #!/bin/bash + + MEMBER_SIZE=$((128 * 1024 * 1024)) + NUM=4 + SEQ_END=$((${NUM}-1)) + CHUNK=256 + BS=4096 + + RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512)) + DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}" + COUNT=$((${MEMBER_SIZE} / ${BS})) + + for i in $(seq 0 ${SEQ_END}); do + dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct + losetup /dev/loop${i} member-${i} + DM_PARMS+=" /dev/loop${i} 0" + done + + echo $DM_PARMS | dmsetup create raid0 + for i in $(seq 0 ${SEQ_END}); do + echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i} + done; + + for i in $(seq 0 ${SEQ_END}); do + dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct + diff /dev/mapper/set-${i} member-${i} + done; + + for i in $(seq 0 ${SEQ_END}); do + dmsetup remove set-${i} + done + + dmsetup remove raid0 + + for i in $(seq 0 ${SEQ_END}); do + losetup -d /dev/loop${i} + rm -f member-${i} + done + +Another example +--------------- + +Intel NVMe drives contain two cores on the physical device. +Each core of the drive has segregated access to its LBA range. +The current LBA model has a RAID 0 128k chunk on each core, resulting +in a 256k stripe across the two cores:: + + Core 0: Core 1: + __________ __________ + | LBA 512| | LBA 768| + | LBA 0 | | LBA 256| + ---------- ---------- + +The purpose of this unstriping is to provide better QoS in noisy +neighbor environments. When two partitions are created on the +aggregate drive without this unstriping, reads on one partition +can affect writes on another partition. This is because the partitions +are striped across the two cores. When we unstripe this hardware RAID 0 +and make partitions on each new exposed device the two partitions are now +physically separated. + +With the dm-unstriped target we're able to segregate an fio script that +has read and write jobs that are independent of each other. Compared to +when we run the test on a combined drive with partitions, we were able +to get a 92% reduction in read latency using this device mapper target. + + +Example dmsetup usage +===================== + +unstriped ontop of Intel NVMe device that has 2 cores +----------------------------------------------------- + +:: + + dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0' + dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0' + +There will now be two devices that expose Intel NVMe core 0 and 1 +respectively:: + + /dev/mapper/nvmset0 + /dev/mapper/nvmset1 + +unstriped ontop of striped with 4 drives using 128K chunk size +-------------------------------------------------------------- + +:: + + dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0' + dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0' + dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0' + dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0' diff --git a/Documentation/device-mapper/unstriped.txt b/Documentation/device-mapper/unstriped.txt deleted file mode 100644 index 0b2a306c54ee..000000000000 --- a/Documentation/device-mapper/unstriped.txt +++ /dev/null @@ -1,124 +0,0 @@ -Introduction -============ - -The device-mapper "unstriped" target provides a transparent mechanism to -unstripe a device-mapper "striped" target to access the underlying disks -without having to touch the true backing block-device. It can also be -used to unstripe a hardware RAID-0 to access backing disks. - -Parameters: - - - - The number of stripes in the RAID 0. - - - The amount of 512B sectors in the chunk striping. - - - The block device you wish to unstripe. - - - The stripe number within the device that corresponds to physical - drive you wish to unstripe. This must be 0 indexed. - - -Why use this module? -==================== - -An example of undoing an existing dm-stripe -------------------------------------------- - -This small bash script will setup 4 loop devices and use the existing -striped target to combine the 4 devices into one. It then will use -the unstriped target ontop of the striped device to access the -individual backing loop devices. We write data to the newly exposed -unstriped devices and verify the data written matches the correct -underlying device on the striped array. - -#!/bin/bash - -MEMBER_SIZE=$((128 * 1024 * 1024)) -NUM=4 -SEQ_END=$((${NUM}-1)) -CHUNK=256 -BS=4096 - -RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512)) -DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}" -COUNT=$((${MEMBER_SIZE} / ${BS})) - -for i in $(seq 0 ${SEQ_END}); do - dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct - losetup /dev/loop${i} member-${i} - DM_PARMS+=" /dev/loop${i} 0" -done - -echo $DM_PARMS | dmsetup create raid0 -for i in $(seq 0 ${SEQ_END}); do - echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i} -done; - -for i in $(seq 0 ${SEQ_END}); do - dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct - diff /dev/mapper/set-${i} member-${i} -done; - -for i in $(seq 0 ${SEQ_END}); do - dmsetup remove set-${i} -done - -dmsetup remove raid0 - -for i in $(seq 0 ${SEQ_END}); do - losetup -d /dev/loop${i} - rm -f member-${i} -done - -Another example ---------------- - -Intel NVMe drives contain two cores on the physical device. -Each core of the drive has segregated access to its LBA range. -The current LBA model has a RAID 0 128k chunk on each core, resulting -in a 256k stripe across the two cores: - - Core 0: Core 1: - __________ __________ - | LBA 512| | LBA 768| - | LBA 0 | | LBA 256| - ---------- ---------- - -The purpose of this unstriping is to provide better QoS in noisy -neighbor environments. When two partitions are created on the -aggregate drive without this unstriping, reads on one partition -can affect writes on another partition. This is because the partitions -are striped across the two cores. When we unstripe this hardware RAID 0 -and make partitions on each new exposed device the two partitions are now -physically separated. - -With the dm-unstriped target we're able to segregate an fio script that -has read and write jobs that are independent of each other. Compared to -when we run the test on a combined drive with partitions, we were able -to get a 92% reduction in read latency using this device mapper target. - - -Example dmsetup usage -===================== - -unstriped ontop of Intel NVMe device that has 2 cores ------------------------------------------------------ -dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0' -dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0' - -There will now be two devices that expose Intel NVMe core 0 and 1 -respectively: -/dev/mapper/nvmset0 -/dev/mapper/nvmset1 - -unstriped ontop of striped with 4 drives using 128K chunk size --------------------------------------------------------------- -dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0' -dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0' -dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0' -dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0' diff --git a/Documentation/device-mapper/verity.rst b/Documentation/device-mapper/verity.rst new file mode 100644 index 000000000000..a4d1c1476d72 --- /dev/null +++ b/Documentation/device-mapper/verity.rst @@ -0,0 +1,229 @@ +========= +dm-verity +========= + +Device-Mapper's "verity" target provides transparent integrity checking of +block devices using a cryptographic digest provided by the kernel crypto API. +This target is read-only. + +Construction Parameters +======================= + +:: + + + + + + [<#opt_params> ] + + + This is the type of the on-disk hash format. + + 0 is the original format used in the Chromium OS. + The salt is appended when hashing, digests are stored continuously and + the rest of the block is padded with zeroes. + + 1 is the current format that should be used for new devices. + The salt is prepended when hashing and each digest is + padded with zeroes to the power of two. + + + This is the device containing data, the integrity of which needs to be + checked. It may be specified as a path, like /dev/sdaX, or a device number, + :. + + + This is the device that supplies the hash tree data. It may be + specified similarly to the device path and may be the same device. If the + same device is used, the hash_start should be outside the configured + dm-verity device. + + + The block size on a data device in bytes. + Each block corresponds to one digest on the hash device. + + + The size of a hash block in bytes. + + + The number of data blocks on the data device. Additional blocks are + inaccessible. You can place hashes to the same partition as data, in this + case hashes are placed after . + + + This is the offset, in -blocks, from the start of hash_dev + to the root block of the hash tree. + + + The cryptographic hash algorithm used for this device. This should + be the name of the algorithm, like "sha1". + + + The hexadecimal encoding of the cryptographic hash of the root hash block + and the salt. This hash should be trusted as there is no other authenticity + beyond this point. + + + The hexadecimal encoding of the salt value. + +<#opt_params> + Number of optional parameters. If there are no optional parameters, + the optional paramaters section can be skipped or #opt_params can be zero. + Otherwise #opt_params is the number of following arguments. + + Example of optional parameters section: + 1 ignore_corruption + +ignore_corruption + Log corrupted blocks, but allow read operations to proceed normally. + +restart_on_corruption + Restart the system when a corrupted block is discovered. This option is + not compatible with ignore_corruption and requires user space support to + avoid restart loops. + +ignore_zero_blocks + Do not verify blocks that are expected to contain zeroes and always return + zeroes instead. This may be useful if the partition contains unused blocks + that are not guaranteed to contain zeroes. + +use_fec_from_device + Use forward error correction (FEC) to recover from corruption if hash + verification fails. Use encoding data from the specified device. This + may be the same device where data and hash blocks reside, in which case + fec_start must be outside data and hash areas. + + If the encoding data covers additional metadata, it must be accessible + on the hash device after the hash blocks. + + Note: block sizes for data and hash devices must match. Also, if the + verity is encrypted the should be too. + +fec_roots + Number of generator roots. This equals to the number of parity bytes in + the encoding data. For example, in RS(M, N) encoding, the number of roots + is M-N. + +fec_blocks + The number of encoding data blocks on the FEC device. The block size for + the FEC device is . + +fec_start + This is the offset, in blocks, from the start of the + FEC device to the beginning of the encoding data. + +check_at_most_once + Verify data blocks only the first time they are read from the data device, + rather than every time. This reduces the overhead of dm-verity so that it + can be used on systems that are memory and/or CPU constrained. However, it + provides a reduced level of security because only offline tampering of the + data device's content will be detected, not online tampering. + + Hash blocks are still verified each time they are read from the hash device, + since verification of hash blocks is less performance critical than data + blocks, and a hash block will not be verified any more after all the data + blocks it covers have been verified anyway. + +Theory of operation +=================== + +dm-verity is meant to be set up as part of a verified boot path. This +may be anything ranging from a boot using tboot or trustedgrub to just +booting from a known-good device (like a USB drive or CD). + +When a dm-verity device is configured, it is expected that the caller +has been authenticated in some way (cryptographic signatures, etc). +After instantiation, all hashes will be verified on-demand during +disk access. If they cannot be verified up to the root node of the +tree, the root hash, then the I/O will fail. This should detect +tampering with any data on the device and the hash data. + +Cryptographic hashes are used to assert the integrity of the device on a +per-block basis. This allows for a lightweight hash computation on first read +into the page cache. Block hashes are stored linearly, aligned to the nearest +block size. + +If forward error correction (FEC) support is enabled any recovery of +corrupted data will be verified using the cryptographic hash of the +corresponding data. This is why combining error correction with +integrity checking is essential. + +Hash Tree +--------- + +Each node in the tree is a cryptographic hash. If it is a leaf node, the hash +of some data block on disk is calculated. If it is an intermediary node, +the hash of a number of child nodes is calculated. + +Each entry in the tree is a collection of neighboring nodes that fit in one +block. The number is determined based on block_size and the size of the +selected cryptographic digest algorithm. The hashes are linearly-ordered in +this entry and any unaligned trailing space is ignored but included when +calculating the parent node. + +The tree looks something like: + + alg = sha256, num_blocks = 32768, block_size = 4096 + +:: + + [ root ] + / . . . \ + [entry_0] [entry_1] + / . . . \ . . . \ + [entry_0_0] . . . [entry_0_127] . . . . [entry_1_127] + / ... \ / . . . \ / \ + blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767 + + +On-disk format +============== + +The verity kernel code does not read the verity metadata on-disk header. +It only reads the hash blocks which directly follow the header. +It is expected that a user-space tool will verify the integrity of the +verity header. + +Alternatively, the header can be omitted and the dmsetup parameters can +be passed via the kernel command-line in a rooted chain of trust where +the command-line is verified. + +Directly following the header (and with sector number padded to the next hash +block boundary) are the hash blocks which are stored a depth at a time +(starting from the root), sorted in order of increasing index. + +The full specification of kernel parameters and on-disk metadata format +is available at the cryptsetup project's wiki page + + https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity + +Status +====== +V (for Valid) is returned if every check performed so far was valid. +If any check failed, C (for Corruption) is returned. + +Example +======= +Set up a device:: + + # dmsetup create vroot --readonly --table \ + "0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\ + "4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\ + "1234000000000000000000000000000000000000000000000000000000000000" + +A command line tool veritysetup is available to compute or verify +the hash tree or activate the kernel device. This is available from +the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/ +(as a libcryptsetup extension). + +Create hash on the device:: + + # veritysetup format /dev/sda1 /dev/sda2 + ... + Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 + +Activate the device:: + + # veritysetup create vroot /dev/sda1 /dev/sda2 \ + 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 diff --git a/Documentation/device-mapper/verity.txt b/Documentation/device-mapper/verity.txt deleted file mode 100644 index b3d2e4a42255..000000000000 --- a/Documentation/device-mapper/verity.txt +++ /dev/null @@ -1,219 +0,0 @@ -dm-verity -========== - -Device-Mapper's "verity" target provides transparent integrity checking of -block devices using a cryptographic digest provided by the kernel crypto API. -This target is read-only. - -Construction Parameters -======================= - - - - - [<#opt_params> ] - - - This is the type of the on-disk hash format. - - 0 is the original format used in the Chromium OS. - The salt is appended when hashing, digests are stored continuously and - the rest of the block is padded with zeroes. - - 1 is the current format that should be used for new devices. - The salt is prepended when hashing and each digest is - padded with zeroes to the power of two. - - - This is the device containing data, the integrity of which needs to be - checked. It may be specified as a path, like /dev/sdaX, or a device number, - :. - - - This is the device that supplies the hash tree data. It may be - specified similarly to the device path and may be the same device. If the - same device is used, the hash_start should be outside the configured - dm-verity device. - - - The block size on a data device in bytes. - Each block corresponds to one digest on the hash device. - - - The size of a hash block in bytes. - - - The number of data blocks on the data device. Additional blocks are - inaccessible. You can place hashes to the same partition as data, in this - case hashes are placed after . - - - This is the offset, in -blocks, from the start of hash_dev - to the root block of the hash tree. - - - The cryptographic hash algorithm used for this device. This should - be the name of the algorithm, like "sha1". - - - The hexadecimal encoding of the cryptographic hash of the root hash block - and the salt. This hash should be trusted as there is no other authenticity - beyond this point. - - - The hexadecimal encoding of the salt value. - -<#opt_params> - Number of optional parameters. If there are no optional parameters, - the optional paramaters section can be skipped or #opt_params can be zero. - Otherwise #opt_params is the number of following arguments. - - Example of optional parameters section: - 1 ignore_corruption - -ignore_corruption - Log corrupted blocks, but allow read operations to proceed normally. - -restart_on_corruption - Restart the system when a corrupted block is discovered. This option is - not compatible with ignore_corruption and requires user space support to - avoid restart loops. - -ignore_zero_blocks - Do not verify blocks that are expected to contain zeroes and always return - zeroes instead. This may be useful if the partition contains unused blocks - that are not guaranteed to contain zeroes. - -use_fec_from_device - Use forward error correction (FEC) to recover from corruption if hash - verification fails. Use encoding data from the specified device. This - may be the same device where data and hash blocks reside, in which case - fec_start must be outside data and hash areas. - - If the encoding data covers additional metadata, it must be accessible - on the hash device after the hash blocks. - - Note: block sizes for data and hash devices must match. Also, if the - verity is encrypted the should be too. - -fec_roots - Number of generator roots. This equals to the number of parity bytes in - the encoding data. For example, in RS(M, N) encoding, the number of roots - is M-N. - -fec_blocks - The number of encoding data blocks on the FEC device. The block size for - the FEC device is . - -fec_start - This is the offset, in blocks, from the start of the - FEC device to the beginning of the encoding data. - -check_at_most_once - Verify data blocks only the first time they are read from the data device, - rather than every time. This reduces the overhead of dm-verity so that it - can be used on systems that are memory and/or CPU constrained. However, it - provides a reduced level of security because only offline tampering of the - data device's content will be detected, not online tampering. - - Hash blocks are still verified each time they are read from the hash device, - since verification of hash blocks is less performance critical than data - blocks, and a hash block will not be verified any more after all the data - blocks it covers have been verified anyway. - -Theory of operation -=================== - -dm-verity is meant to be set up as part of a verified boot path. This -may be anything ranging from a boot using tboot or trustedgrub to just -booting from a known-good device (like a USB drive or CD). - -When a dm-verity device is configured, it is expected that the caller -has been authenticated in some way (cryptographic signatures, etc). -After instantiation, all hashes will be verified on-demand during -disk access. If they cannot be verified up to the root node of the -tree, the root hash, then the I/O will fail. This should detect -tampering with any data on the device and the hash data. - -Cryptographic hashes are used to assert the integrity of the device on a -per-block basis. This allows for a lightweight hash computation on first read -into the page cache. Block hashes are stored linearly, aligned to the nearest -block size. - -If forward error correction (FEC) support is enabled any recovery of -corrupted data will be verified using the cryptographic hash of the -corresponding data. This is why combining error correction with -integrity checking is essential. - -Hash Tree ---------- - -Each node in the tree is a cryptographic hash. If it is a leaf node, the hash -of some data block on disk is calculated. If it is an intermediary node, -the hash of a number of child nodes is calculated. - -Each entry in the tree is a collection of neighboring nodes that fit in one -block. The number is determined based on block_size and the size of the -selected cryptographic digest algorithm. The hashes are linearly-ordered in -this entry and any unaligned trailing space is ignored but included when -calculating the parent node. - -The tree looks something like: - -alg = sha256, num_blocks = 32768, block_size = 4096 - - [ root ] - / . . . \ - [entry_0] [entry_1] - / . . . \ . . . \ - [entry_0_0] . . . [entry_0_127] . . . . [entry_1_127] - / ... \ / . . . \ / \ - blk_0 ... blk_127 blk_16256 blk_16383 blk_32640 . . . blk_32767 - - -On-disk format -============== - -The verity kernel code does not read the verity metadata on-disk header. -It only reads the hash blocks which directly follow the header. -It is expected that a user-space tool will verify the integrity of the -verity header. - -Alternatively, the header can be omitted and the dmsetup parameters can -be passed via the kernel command-line in a rooted chain of trust where -the command-line is verified. - -Directly following the header (and with sector number padded to the next hash -block boundary) are the hash blocks which are stored a depth at a time -(starting from the root), sorted in order of increasing index. - -The full specification of kernel parameters and on-disk metadata format -is available at the cryptsetup project's wiki page - https://gitlab.com/cryptsetup/cryptsetup/wikis/DMVerity - -Status -====== -V (for Valid) is returned if every check performed so far was valid. -If any check failed, C (for Corruption) is returned. - -Example -======= -Set up a device: - # dmsetup create vroot --readonly --table \ - "0 2097152 verity 1 /dev/sda1 /dev/sda2 4096 4096 262144 1 sha256 "\ - "4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 "\ - "1234000000000000000000000000000000000000000000000000000000000000" - -A command line tool veritysetup is available to compute or verify -the hash tree or activate the kernel device. This is available from -the cryptsetup upstream repository https://gitlab.com/cryptsetup/cryptsetup/ -(as a libcryptsetup extension). - -Create hash on the device: - # veritysetup format /dev/sda1 /dev/sda2 - ... - Root hash: 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 - -Activate the device: - # veritysetup create vroot /dev/sda1 /dev/sda2 \ - 4392712ba01368efdf14b05c76f9e4df0d53664630b5d48632ed17a137f39076 diff --git a/Documentation/device-mapper/writecache.rst b/Documentation/device-mapper/writecache.rst new file mode 100644 index 000000000000..d3d7690f5e8d --- /dev/null +++ b/Documentation/device-mapper/writecache.rst @@ -0,0 +1,79 @@ +================= +Writecache target +================= + +The writecache target caches writes on persistent memory or on SSD. It +doesn't cache reads because reads are supposed to be cached in page cache +in normal RAM. + +When the device is constructed, the first sector should be zeroed or the +first sector should contain valid superblock from previous invocation. + +Constructor parameters: + +1. type of the cache device - "p" or "s" + + - p - persistent memory + - s - SSD +2. the underlying device that will be cached +3. the cache device +4. block size (4096 is recommended; the maximum block size is the page + size) +5. the number of optional parameters (the parameters with an argument + count as two) + + start_sector n (default: 0) + offset from the start of cache device in 512-byte sectors + high_watermark n (default: 50) + start writeback when the number of used blocks reach this + watermark + low_watermark x (default: 45) + stop writeback when the number of used blocks drops below + this watermark + writeback_jobs n (default: unlimited) + limit the number of blocks that are in flight during + writeback. Setting this value reduces writeback + throughput, but it may improve latency of read requests + autocommit_blocks n (default: 64 for pmem, 65536 for ssd) + when the application writes this amount of blocks without + issuing the FLUSH request, the blocks are automatically + commited + autocommit_time ms (default: 1000) + autocommit time in milliseconds. The data is automatically + commited if this time passes and no FLUSH request is + received + fua (by default on) + applicable only to persistent memory - use the FUA flag + when writing data from persistent memory back to the + underlying device + nofua + applicable only to persistent memory - don't use the FUA + flag when writing back data and send the FLUSH request + afterwards + + - some underlying devices perform better with fua, some + with nofua. The user should test it + +Status: +1. error indicator - 0 if there was no error, otherwise error number +2. the number of blocks +3. the number of free blocks +4. the number of blocks under writeback + +Messages: + flush + flush the cache device. The message returns successfully + if the cache device was flushed without an error + flush_on_suspend + flush the cache device on next suspend. Use this message + when you are going to remove the cache device. The proper + sequence for removing the cache device is: + + 1. send the "flush_on_suspend" message + 2. load an inactive table with a linear target that maps + to the underlying device + 3. suspend the device + 4. ask for status and verify that there are no errors + 5. resume the device, so that it will use the linear + target + 6. the cache device is now inactive and it can be deleted diff --git a/Documentation/device-mapper/writecache.txt b/Documentation/device-mapper/writecache.txt deleted file mode 100644 index 01532b3008ae..000000000000 --- a/Documentation/device-mapper/writecache.txt +++ /dev/null @@ -1,70 +0,0 @@ -The writecache target caches writes on persistent memory or on SSD. It -doesn't cache reads because reads are supposed to be cached in page cache -in normal RAM. - -When the device is constructed, the first sector should be zeroed or the -first sector should contain valid superblock from previous invocation. - -Constructor parameters: -1. type of the cache device - "p" or "s" - p - persistent memory - s - SSD -2. the underlying device that will be cached -3. the cache device -4. block size (4096 is recommended; the maximum block size is the page - size) -5. the number of optional parameters (the parameters with an argument - count as two) - start_sector n (default: 0) - offset from the start of cache device in 512-byte sectors - high_watermark n (default: 50) - start writeback when the number of used blocks reach this - watermark - low_watermark x (default: 45) - stop writeback when the number of used blocks drops below - this watermark - writeback_jobs n (default: unlimited) - limit the number of blocks that are in flight during - writeback. Setting this value reduces writeback - throughput, but it may improve latency of read requests - autocommit_blocks n (default: 64 for pmem, 65536 for ssd) - when the application writes this amount of blocks without - issuing the FLUSH request, the blocks are automatically - commited - autocommit_time ms (default: 1000) - autocommit time in milliseconds. The data is automatically - commited if this time passes and no FLUSH request is - received - fua (by default on) - applicable only to persistent memory - use the FUA flag - when writing data from persistent memory back to the - underlying device - nofua - applicable only to persistent memory - don't use the FUA - flag when writing back data and send the FLUSH request - afterwards - - some underlying devices perform better with fua, some - with nofua. The user should test it - -Status: -1. error indicator - 0 if there was no error, otherwise error number -2. the number of blocks -3. the number of free blocks -4. the number of blocks under writeback - -Messages: - flush - flush the cache device. The message returns successfully - if the cache device was flushed without an error - flush_on_suspend - flush the cache device on next suspend. Use this message - when you are going to remove the cache device. The proper - sequence for removing the cache device is: - 1. send the "flush_on_suspend" message - 2. load an inactive table with a linear target that maps - to the underlying device - 3. suspend the device - 4. ask for status and verify that there are no errors - 5. resume the device, so that it will use the linear - target - 6. the cache device is now inactive and it can be deleted diff --git a/Documentation/device-mapper/zero.rst b/Documentation/device-mapper/zero.rst new file mode 100644 index 000000000000..11fb5cf4597c --- /dev/null +++ b/Documentation/device-mapper/zero.rst @@ -0,0 +1,37 @@ +======= +dm-zero +======= + +Device-Mapper's "zero" target provides a block-device that always returns +zero'd data on reads and silently drops writes. This is similar behavior to +/dev/zero, but as a block-device instead of a character-device. + +Dm-zero has no target-specific parameters. + +One very interesting use of dm-zero is for creating "sparse" devices in +conjunction with dm-snapshot. A sparse device reports a device-size larger +than the amount of actual storage space available for that device. A user can +write data anywhere within the sparse device and read it back like a normal +device. Reads to previously unwritten areas will return a zero'd buffer. When +enough data has been written to fill up the actual storage space, the sparse +device is deactivated. This can be very useful for testing device and +filesystem limitations. + +To create a sparse device, start by creating a dm-zero device that's the +desired size of the sparse device. For this example, we'll assume a 10TB +sparse device:: + + TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors + echo "0 $TEN_TERABYTES zero" | dmsetup create zero1 + +Then create a snapshot of the zero device, using any available block-device as +the COW device. The size of the COW device will determine the amount of real +space available to the sparse device. For this example, we'll assume /dev/sdb1 +is an available 10GB partition:: + + echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \ + dmsetup create sparse1 + +This will create a 10TB sparse device called /dev/mapper/sparse1 that has +10GB of actual storage space available. If more than 10GB of data is written +to this device, it will start returning I/O errors. diff --git a/Documentation/device-mapper/zero.txt b/Documentation/device-mapper/zero.txt deleted file mode 100644 index 20fb38e7fa7e..000000000000 --- a/Documentation/device-mapper/zero.txt +++ /dev/null @@ -1,37 +0,0 @@ -dm-zero -======= - -Device-Mapper's "zero" target provides a block-device that always returns -zero'd data on reads and silently drops writes. This is similar behavior to -/dev/zero, but as a block-device instead of a character-device. - -Dm-zero has no target-specific parameters. - -One very interesting use of dm-zero is for creating "sparse" devices in -conjunction with dm-snapshot. A sparse device reports a device-size larger -than the amount of actual storage space available for that device. A user can -write data anywhere within the sparse device and read it back like a normal -device. Reads to previously unwritten areas will return a zero'd buffer. When -enough data has been written to fill up the actual storage space, the sparse -device is deactivated. This can be very useful for testing device and -filesystem limitations. - -To create a sparse device, start by creating a dm-zero device that's the -desired size of the sparse device. For this example, we'll assume a 10TB -sparse device. - -TEN_TERABYTES=`expr 10 \* 1024 \* 1024 \* 1024 \* 2` # 10 TB in sectors -echo "0 $TEN_TERABYTES zero" | dmsetup create zero1 - -Then create a snapshot of the zero device, using any available block-device as -the COW device. The size of the COW device will determine the amount of real -space available to the sparse device. For this example, we'll assume /dev/sdb1 -is an available 10GB partition. - -echo "0 $TEN_TERABYTES snapshot /dev/mapper/zero1 /dev/sdb1 p 128" | \ - dmsetup create sparse1 - -This will create a 10TB sparse device called /dev/mapper/sparse1 that has -10GB of actual storage space available. If more than 10GB of data is written -to this device, it will start returning I/O errors. - diff --git a/Documentation/filesystems/ubifs-authentication.md b/Documentation/filesystems/ubifs-authentication.md index 028b3e2e25f9..23e698167141 100644 --- a/Documentation/filesystems/ubifs-authentication.md +++ b/Documentation/filesystems/ubifs-authentication.md @@ -417,9 +417,9 @@ will then have to be provided beforehand in the normal way. [DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/ -[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt +[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.rst -[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt +[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.rst [FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 45254b3ef715..5ccac0b77f17 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -453,7 +453,7 @@ config DM_INIT Enable "dm-mod.create=" parameter to create mapped devices at init time. This option is useful to allow mounting rootfs without requiring an initramfs. - See Documentation/device-mapper/dm-init.txt for dm-mod.create="..." + See Documentation/device-mapper/dm-init.rst for dm-mod.create="..." format. If unsure, say N. diff --git a/drivers/md/dm-init.c b/drivers/md/dm-init.c index 352e803f566e..a58d0944f592 100644 --- a/drivers/md/dm-init.c +++ b/drivers/md/dm-init.c @@ -25,7 +25,7 @@ static char *create; * Format: dm-mod.create=,,,,
[,
+][;,,,,
[,
+]+] * Table format: * - * See Documentation/device-mapper/dm-init.txt for dm-mod.create="..." format + * See Documentation/device-mapper/dm-init.rst for dm-mod.create="..." format * details. */ diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 9fdef6897316..7a87a640f8ba 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -3558,7 +3558,7 @@ static void raid_status(struct dm_target *ti, status_type_t type, * v1.5.0+: * * Sync action: - * See Documentation/device-mapper/dm-raid.txt for + * See Documentation/device-mapper/dm-raid.rst for * information on each of these states. */ DMEMIT(" %s", sync_action); -- cgit v1.2.3 From 10ffebbed5503b1830c7920ef528075785351be6 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 12 Jun 2019 14:52:44 -0300 Subject: docs: fault-injection: convert docs to ReST and rename to *.rst The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Federico Vaga Signed-off-by: Jonathan Corbet --- Documentation/fault-injection/fault-injection.rst | 446 +++++++++++++++++++++ Documentation/fault-injection/fault-injection.txt | 435 -------------------- Documentation/fault-injection/index.rst | 20 + .../fault-injection/notifier-error-inject.rst | 98 +++++ .../fault-injection/notifier-error-inject.txt | 94 ----- .../fault-injection/nvme-fault-injection.rst | 120 ++++++ .../fault-injection/nvme-fault-injection.txt | 116 ------ Documentation/fault-injection/provoke-crashes.rst | 48 +++ Documentation/fault-injection/provoke-crashes.txt | 38 -- Documentation/process/4.Coding.rst | 2 +- .../translations/it_IT/process/4.Coding.rst | 2 +- .../translations/zh_CN/process/4.Coding.rst | 2 +- drivers/misc/lkdtm/core.c | 2 +- include/linux/fault-inject.h | 2 +- lib/Kconfig.debug | 2 +- tools/testing/fault-injection/failcmd.sh | 2 +- 16 files changed, 739 insertions(+), 690 deletions(-) create mode 100644 Documentation/fault-injection/fault-injection.rst delete mode 100644 Documentation/fault-injection/fault-injection.txt create mode 100644 Documentation/fault-injection/index.rst create mode 100644 Documentation/fault-injection/notifier-error-inject.rst delete mode 100644 Documentation/fault-injection/notifier-error-inject.txt create mode 100644 Documentation/fault-injection/nvme-fault-injection.rst delete mode 100644 Documentation/fault-injection/nvme-fault-injection.txt create mode 100644 Documentation/fault-injection/provoke-crashes.rst delete mode 100644 Documentation/fault-injection/provoke-crashes.txt (limited to 'drivers') diff --git a/Documentation/fault-injection/fault-injection.rst b/Documentation/fault-injection/fault-injection.rst new file mode 100644 index 000000000000..f51bb21d20e4 --- /dev/null +++ b/Documentation/fault-injection/fault-injection.rst @@ -0,0 +1,446 @@ +=========================================== +Fault injection capabilities infrastructure +=========================================== + +See also drivers/md/md-faulty.c and "every_nth" module option for scsi_debug. + + +Available fault injection capabilities +-------------------------------------- + +- failslab + + injects slab allocation failures. (kmalloc(), kmem_cache_alloc(), ...) + +- fail_page_alloc + + injects page allocation failures. (alloc_pages(), get_free_pages(), ...) + +- fail_futex + + injects futex deadlock and uaddr fault errors. + +- fail_make_request + + injects disk IO errors on devices permitted by setting + /sys/block//make-it-fail or + /sys/block///make-it-fail. (generic_make_request()) + +- fail_mmc_request + + injects MMC data errors on devices permitted by setting + debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request + +- fail_function + + injects error return on specific functions, which are marked by + ALLOW_ERROR_INJECTION() macro, by setting debugfs entries + under /sys/kernel/debug/fail_function. No boot option supported. + +- NVMe fault injection + + inject NVMe status code and retry flag on devices permitted by setting + debugfs entries under /sys/kernel/debug/nvme*/fault_inject. The default + status code is NVME_SC_INVALID_OPCODE with no retry. The status code and + retry flag can be set via the debugfs. + + +Configure fault-injection capabilities behavior +----------------------------------------------- + +debugfs entries +^^^^^^^^^^^^^^^ + +fault-inject-debugfs kernel module provides some debugfs entries for runtime +configuration of fault-injection capabilities. + +- /sys/kernel/debug/fail*/probability: + + likelihood of failure injection, in percent. + + Format: + + Note that one-failure-per-hundred is a very high error rate + for some testcases. Consider setting probability=100 and configure + /sys/kernel/debug/fail*/interval for such testcases. + +- /sys/kernel/debug/fail*/interval: + + specifies the interval between failures, for calls to + should_fail() that pass all the other tests. + + Note that if you enable this, by setting interval>1, you will + probably want to set probability=100. + +- /sys/kernel/debug/fail*/times: + + specifies how many times failures may happen at most. + A value of -1 means "no limit". + +- /sys/kernel/debug/fail*/space: + + specifies an initial resource "budget", decremented by "size" + on each call to should_fail(,size). Failure injection is + suppressed until "space" reaches zero. + +- /sys/kernel/debug/fail*/verbose + + Format: { 0 | 1 | 2 } + + specifies the verbosity of the messages when failure is + injected. '0' means no messages; '1' will print only a single + log line per failure; '2' will print a call trace too -- useful + to debug the problems revealed by fault injection. + +- /sys/kernel/debug/fail*/task-filter: + + Format: { 'Y' | 'N' } + + A value of 'N' disables filtering by process (default). + Any positive value limits failures to only processes indicated by + /proc//make-it-fail==1. + +- /sys/kernel/debug/fail*/require-start, + /sys/kernel/debug/fail*/require-end, + /sys/kernel/debug/fail*/reject-start, + /sys/kernel/debug/fail*/reject-end: + + specifies the range of virtual addresses tested during + stacktrace walking. Failure is injected only if some caller + in the walked stacktrace lies within the required range, and + none lies within the rejected range. + Default required range is [0,ULONG_MAX) (whole of virtual address space). + Default rejected range is [0,0). + +- /sys/kernel/debug/fail*/stacktrace-depth: + + specifies the maximum stacktrace depth walked during search + for a caller within [require-start,require-end) OR + [reject-start,reject-end). + +- /sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem: + + Format: { 'Y' | 'N' } + + default is 'N', setting it to 'Y' won't inject failures into + highmem/user allocations. + +- /sys/kernel/debug/failslab/ignore-gfp-wait: +- /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait: + + Format: { 'Y' | 'N' } + + default is 'N', setting it to 'Y' will inject failures + only into non-sleep allocations (GFP_ATOMIC allocations). + +- /sys/kernel/debug/fail_page_alloc/min-order: + + specifies the minimum page allocation order to be injected + failures. + +- /sys/kernel/debug/fail_futex/ignore-private: + + Format: { 'Y' | 'N' } + + default is 'N', setting it to 'Y' will disable failure injections + when dealing with private (address space) futexes. + +- /sys/kernel/debug/fail_function/inject: + + Format: { 'function-name' | '!function-name' | '' } + + specifies the target function of error injection by name. + If the function name leads '!' prefix, given function is + removed from injection list. If nothing specified ('') + injection list is cleared. + +- /sys/kernel/debug/fail_function/injectable: + + (read only) shows error injectable functions and what type of + error values can be specified. The error type will be one of + below; + - NULL: retval must be 0. + - ERRNO: retval must be -1 to -MAX_ERRNO (-4096). + - ERR_NULL: retval must be 0 or -1 to -MAX_ERRNO (-4096). + +- /sys/kernel/debug/fail_function//retval: + + specifies the "error" return value to inject to the given + function for given function. This will be created when + user specifies new injection entry. + +Boot option +^^^^^^^^^^^ + +In order to inject faults while debugfs is not available (early boot time), +use the boot option:: + + failslab= + fail_page_alloc= + fail_make_request= + fail_futex= + mmc_core.fail_request=,,, + +proc entries +^^^^^^^^^^^^ + +- /proc//fail-nth, + /proc/self/task//fail-nth: + + Write to this file of integer N makes N-th call in the task fail. + Read from this file returns a integer value. A value of '0' indicates + that the fault setup with a previous write to this file was injected. + A positive integer N indicates that the fault wasn't yet injected. + Note that this file enables all types of faults (slab, futex, etc). + This setting takes precedence over all other generic debugfs settings + like probability, interval, times, etc. But per-capability settings + (e.g. fail_futex/ignore-private) take precedence over it. + + This feature is intended for systematic testing of faults in a single + system call. See an example below. + +How to add new fault injection capability +----------------------------------------- + +- #include + +- define the fault attributes + + DECLARE_FAULT_ATTR(name); + + Please see the definition of struct fault_attr in fault-inject.h + for details. + +- provide a way to configure fault attributes + +- boot option + + If you need to enable the fault injection capability from boot time, you can + provide boot option to configure it. There is a helper function for it: + + setup_fault_attr(attr, str); + +- debugfs entries + + failslab, fail_page_alloc, and fail_make_request use this way. + Helper functions: + + fault_create_debugfs_attr(name, parent, attr); + +- module parameters + + If the scope of the fault injection capability is limited to a + single kernel module, it is better to provide module parameters to + configure the fault attributes. + +- add a hook to insert failures + + Upon should_fail() returning true, client code should inject a failure: + + should_fail(attr, size); + +Application Examples +-------------------- + +- Inject slab allocation failures into module init/exit code:: + + #!/bin/bash + + FAILTYPE=failslab + echo Y > /sys/kernel/debug/$FAILTYPE/task-filter + echo 10 > /sys/kernel/debug/$FAILTYPE/probability + echo 100 > /sys/kernel/debug/$FAILTYPE/interval + echo -1 > /sys/kernel/debug/$FAILTYPE/times + echo 0 > /sys/kernel/debug/$FAILTYPE/space + echo 2 > /sys/kernel/debug/$FAILTYPE/verbose + echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait + + faulty_system() + { + bash -c "echo 1 > /proc/self/make-it-fail && exec $*" + } + + if [ $# -eq 0 ] + then + echo "Usage: $0 modulename [ modulename ... ]" + exit 1 + fi + + for m in $* + do + echo inserting $m... + faulty_system modprobe $m + + echo removing $m... + faulty_system modprobe -r $m + done + +------------------------------------------------------------------------------ + +- Inject page allocation failures only for a specific module:: + + #!/bin/bash + + FAILTYPE=fail_page_alloc + module=$1 + + if [ -z $module ] + then + echo "Usage: $0 " + exit 1 + fi + + modprobe $module + + if [ ! -d /sys/module/$module/sections ] + then + echo Module $module is not loaded + exit 1 + fi + + cat /sys/module/$module/sections/.text > /sys/kernel/debug/$FAILTYPE/require-start + cat /sys/module/$module/sections/.data > /sys/kernel/debug/$FAILTYPE/require-end + + echo N > /sys/kernel/debug/$FAILTYPE/task-filter + echo 10 > /sys/kernel/debug/$FAILTYPE/probability + echo 100 > /sys/kernel/debug/$FAILTYPE/interval + echo -1 > /sys/kernel/debug/$FAILTYPE/times + echo 0 > /sys/kernel/debug/$FAILTYPE/space + echo 2 > /sys/kernel/debug/$FAILTYPE/verbose + echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait + echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-highmem + echo 10 > /sys/kernel/debug/$FAILTYPE/stacktrace-depth + + trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" SIGINT SIGTERM EXIT + + echo "Injecting errors into the module $module... (interrupt to stop)" + sleep 1000000 + +------------------------------------------------------------------------------ + +- Inject open_ctree error while btrfs mount:: + + #!/bin/bash + + rm -f testfile.img + dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1 + DEVICE=$(losetup --show -f testfile.img) + mkfs.btrfs -f $DEVICE + mkdir -p tmpmnt + + FAILTYPE=fail_function + FAILFUNC=open_ctree + echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject + echo -12 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval + echo N > /sys/kernel/debug/$FAILTYPE/task-filter + echo 100 > /sys/kernel/debug/$FAILTYPE/probability + echo 0 > /sys/kernel/debug/$FAILTYPE/interval + echo -1 > /sys/kernel/debug/$FAILTYPE/times + echo 0 > /sys/kernel/debug/$FAILTYPE/space + echo 1 > /sys/kernel/debug/$FAILTYPE/verbose + + mount -t btrfs $DEVICE tmpmnt + if [ $? -ne 0 ] + then + echo "SUCCESS!" + else + echo "FAILED!" + umount tmpmnt + fi + + echo > /sys/kernel/debug/$FAILTYPE/inject + + rmdir tmpmnt + losetup -d $DEVICE + rm testfile.img + + +Tool to run command with failslab or fail_page_alloc +---------------------------------------------------- +In order to make it easier to accomplish the tasks mentioned above, we can use +tools/testing/fault-injection/failcmd.sh. Please run a command +"./tools/testing/fault-injection/failcmd.sh --help" for more information and +see the following examples. + +Examples: + +Run a command "make -C tools/testing/selftests/ run_tests" with injecting slab +allocation failure:: + + # ./tools/testing/fault-injection/failcmd.sh \ + -- make -C tools/testing/selftests/ run_tests + +Same as above except to specify 100 times failures at most instead of one time +at most by default:: + + # ./tools/testing/fault-injection/failcmd.sh --times=100 \ + -- make -C tools/testing/selftests/ run_tests + +Same as above except to inject page allocation failure instead of slab +allocation failure:: + + # env FAILCMD_TYPE=fail_page_alloc \ + ./tools/testing/fault-injection/failcmd.sh --times=100 \ + -- make -C tools/testing/selftests/ run_tests + +Systematic faults using fail-nth +--------------------------------- + +The following code systematically faults 0-th, 1-st, 2-nd and so on +capabilities in the socketpair() system call:: + + #include + #include + #include + #include + #include + #include + #include + #include + #include + #include + + int main() + { + int i, err, res, fail_nth, fds[2]; + char buf[128]; + + system("echo N > /sys/kernel/debug/failslab/ignore-gfp-wait"); + sprintf(buf, "/proc/self/task/%ld/fail-nth", syscall(SYS_gettid)); + fail_nth = open(buf, O_RDWR); + for (i = 1;; i++) { + sprintf(buf, "%d", i); + write(fail_nth, buf, strlen(buf)); + res = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds); + err = errno; + pread(fail_nth, buf, sizeof(buf), 0); + if (res == 0) { + close(fds[0]); + close(fds[1]); + } + printf("%d-th fault %c: res=%d/%d\n", i, atoi(buf) ? 'N' : 'Y', + res, err); + if (atoi(buf)) + break; + } + return 0; + } + +An example output:: + + 1-th fault Y: res=-1/23 + 2-th fault Y: res=-1/23 + 3-th fault Y: res=-1/12 + 4-th fault Y: res=-1/12 + 5-th fault Y: res=-1/23 + 6-th fault Y: res=-1/23 + 7-th fault Y: res=-1/23 + 8-th fault Y: res=-1/12 + 9-th fault Y: res=-1/12 + 10-th fault Y: res=-1/12 + 11-th fault Y: res=-1/12 + 12-th fault Y: res=-1/12 + 13-th fault Y: res=-1/12 + 14-th fault Y: res=-1/12 + 15-th fault Y: res=-1/12 + 16-th fault N: res=0/12 diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt deleted file mode 100644 index a17517a083c3..000000000000 --- a/Documentation/fault-injection/fault-injection.txt +++ /dev/null @@ -1,435 +0,0 @@ -Fault injection capabilities infrastructure -=========================================== - -See also drivers/md/md-faulty.c and "every_nth" module option for scsi_debug. - - -Available fault injection capabilities --------------------------------------- - -o failslab - - injects slab allocation failures. (kmalloc(), kmem_cache_alloc(), ...) - -o fail_page_alloc - - injects page allocation failures. (alloc_pages(), get_free_pages(), ...) - -o fail_futex - - injects futex deadlock and uaddr fault errors. - -o fail_make_request - - injects disk IO errors on devices permitted by setting - /sys/block//make-it-fail or - /sys/block///make-it-fail. (generic_make_request()) - -o fail_mmc_request - - injects MMC data errors on devices permitted by setting - debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request - -o fail_function - - injects error return on specific functions, which are marked by - ALLOW_ERROR_INJECTION() macro, by setting debugfs entries - under /sys/kernel/debug/fail_function. No boot option supported. - -o NVMe fault injection - - inject NVMe status code and retry flag on devices permitted by setting - debugfs entries under /sys/kernel/debug/nvme*/fault_inject. The default - status code is NVME_SC_INVALID_OPCODE with no retry. The status code and - retry flag can be set via the debugfs. - - -Configure fault-injection capabilities behavior ------------------------------------------------ - -o debugfs entries - -fault-inject-debugfs kernel module provides some debugfs entries for runtime -configuration of fault-injection capabilities. - -- /sys/kernel/debug/fail*/probability: - - likelihood of failure injection, in percent. - Format: - - Note that one-failure-per-hundred is a very high error rate - for some testcases. Consider setting probability=100 and configure - /sys/kernel/debug/fail*/interval for such testcases. - -- /sys/kernel/debug/fail*/interval: - - specifies the interval between failures, for calls to - should_fail() that pass all the other tests. - - Note that if you enable this, by setting interval>1, you will - probably want to set probability=100. - -- /sys/kernel/debug/fail*/times: - - specifies how many times failures may happen at most. - A value of -1 means "no limit". - -- /sys/kernel/debug/fail*/space: - - specifies an initial resource "budget", decremented by "size" - on each call to should_fail(,size). Failure injection is - suppressed until "space" reaches zero. - -- /sys/kernel/debug/fail*/verbose - - Format: { 0 | 1 | 2 } - specifies the verbosity of the messages when failure is - injected. '0' means no messages; '1' will print only a single - log line per failure; '2' will print a call trace too -- useful - to debug the problems revealed by fault injection. - -- /sys/kernel/debug/fail*/task-filter: - - Format: { 'Y' | 'N' } - A value of 'N' disables filtering by process (default). - Any positive value limits failures to only processes indicated by - /proc//make-it-fail==1. - -- /sys/kernel/debug/fail*/require-start: -- /sys/kernel/debug/fail*/require-end: -- /sys/kernel/debug/fail*/reject-start: -- /sys/kernel/debug/fail*/reject-end: - - specifies the range of virtual addresses tested during - stacktrace walking. Failure is injected only if some caller - in the walked stacktrace lies within the required range, and - none lies within the rejected range. - Default required range is [0,ULONG_MAX) (whole of virtual address space). - Default rejected range is [0,0). - -- /sys/kernel/debug/fail*/stacktrace-depth: - - specifies the maximum stacktrace depth walked during search - for a caller within [require-start,require-end) OR - [reject-start,reject-end). - -- /sys/kernel/debug/fail_page_alloc/ignore-gfp-highmem: - - Format: { 'Y' | 'N' } - default is 'N', setting it to 'Y' won't inject failures into - highmem/user allocations. - -- /sys/kernel/debug/failslab/ignore-gfp-wait: -- /sys/kernel/debug/fail_page_alloc/ignore-gfp-wait: - - Format: { 'Y' | 'N' } - default is 'N', setting it to 'Y' will inject failures - only into non-sleep allocations (GFP_ATOMIC allocations). - -- /sys/kernel/debug/fail_page_alloc/min-order: - - specifies the minimum page allocation order to be injected - failures. - -- /sys/kernel/debug/fail_futex/ignore-private: - - Format: { 'Y' | 'N' } - default is 'N', setting it to 'Y' will disable failure injections - when dealing with private (address space) futexes. - -- /sys/kernel/debug/fail_function/inject: - - Format: { 'function-name' | '!function-name' | '' } - specifies the target function of error injection by name. - If the function name leads '!' prefix, given function is - removed from injection list. If nothing specified ('') - injection list is cleared. - -- /sys/kernel/debug/fail_function/injectable: - - (read only) shows error injectable functions and what type of - error values can be specified. The error type will be one of - below; - - NULL: retval must be 0. - - ERRNO: retval must be -1 to -MAX_ERRNO (-4096). - - ERR_NULL: retval must be 0 or -1 to -MAX_ERRNO (-4096). - -- /sys/kernel/debug/fail_function//retval: - - specifies the "error" return value to inject to the given - function for given function. This will be created when - user specifies new injection entry. - -o Boot option - -In order to inject faults while debugfs is not available (early boot time), -use the boot option: - - failslab= - fail_page_alloc= - fail_make_request= - fail_futex= - mmc_core.fail_request=,,, - -o proc entries - -- /proc//fail-nth: -- /proc/self/task//fail-nth: - - Write to this file of integer N makes N-th call in the task fail. - Read from this file returns a integer value. A value of '0' indicates - that the fault setup with a previous write to this file was injected. - A positive integer N indicates that the fault wasn't yet injected. - Note that this file enables all types of faults (slab, futex, etc). - This setting takes precedence over all other generic debugfs settings - like probability, interval, times, etc. But per-capability settings - (e.g. fail_futex/ignore-private) take precedence over it. - - This feature is intended for systematic testing of faults in a single - system call. See an example below. - -How to add new fault injection capability ------------------------------------------ - -o #include - -o define the fault attributes - - DECLARE_FAULT_ATTR(name); - - Please see the definition of struct fault_attr in fault-inject.h - for details. - -o provide a way to configure fault attributes - -- boot option - - If you need to enable the fault injection capability from boot time, you can - provide boot option to configure it. There is a helper function for it: - - setup_fault_attr(attr, str); - -- debugfs entries - - failslab, fail_page_alloc, and fail_make_request use this way. - Helper functions: - - fault_create_debugfs_attr(name, parent, attr); - -- module parameters - - If the scope of the fault injection capability is limited to a - single kernel module, it is better to provide module parameters to - configure the fault attributes. - -o add a hook to insert failures - - Upon should_fail() returning true, client code should inject a failure. - - should_fail(attr, size); - -Application Examples --------------------- - -o Inject slab allocation failures into module init/exit code - -#!/bin/bash - -FAILTYPE=failslab -echo Y > /sys/kernel/debug/$FAILTYPE/task-filter -echo 10 > /sys/kernel/debug/$FAILTYPE/probability -echo 100 > /sys/kernel/debug/$FAILTYPE/interval -echo -1 > /sys/kernel/debug/$FAILTYPE/times -echo 0 > /sys/kernel/debug/$FAILTYPE/space -echo 2 > /sys/kernel/debug/$FAILTYPE/verbose -echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait - -faulty_system() -{ - bash -c "echo 1 > /proc/self/make-it-fail && exec $*" -} - -if [ $# -eq 0 ] -then - echo "Usage: $0 modulename [ modulename ... ]" - exit 1 -fi - -for m in $* -do - echo inserting $m... - faulty_system modprobe $m - - echo removing $m... - faulty_system modprobe -r $m -done - ------------------------------------------------------------------------------- - -o Inject page allocation failures only for a specific module - -#!/bin/bash - -FAILTYPE=fail_page_alloc -module=$1 - -if [ -z $module ] -then - echo "Usage: $0 " - exit 1 -fi - -modprobe $module - -if [ ! -d /sys/module/$module/sections ] -then - echo Module $module is not loaded - exit 1 -fi - -cat /sys/module/$module/sections/.text > /sys/kernel/debug/$FAILTYPE/require-start -cat /sys/module/$module/sections/.data > /sys/kernel/debug/$FAILTYPE/require-end - -echo N > /sys/kernel/debug/$FAILTYPE/task-filter -echo 10 > /sys/kernel/debug/$FAILTYPE/probability -echo 100 > /sys/kernel/debug/$FAILTYPE/interval -echo -1 > /sys/kernel/debug/$FAILTYPE/times -echo 0 > /sys/kernel/debug/$FAILTYPE/space -echo 2 > /sys/kernel/debug/$FAILTYPE/verbose -echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-wait -echo 1 > /sys/kernel/debug/$FAILTYPE/ignore-gfp-highmem -echo 10 > /sys/kernel/debug/$FAILTYPE/stacktrace-depth - -trap "echo 0 > /sys/kernel/debug/$FAILTYPE/probability" SIGINT SIGTERM EXIT - -echo "Injecting errors into the module $module... (interrupt to stop)" -sleep 1000000 - ------------------------------------------------------------------------------- - -o Inject open_ctree error while btrfs mount - -#!/bin/bash - -rm -f testfile.img -dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1 -DEVICE=$(losetup --show -f testfile.img) -mkfs.btrfs -f $DEVICE -mkdir -p tmpmnt - -FAILTYPE=fail_function -FAILFUNC=open_ctree -echo $FAILFUNC > /sys/kernel/debug/$FAILTYPE/inject -echo -12 > /sys/kernel/debug/$FAILTYPE/$FAILFUNC/retval -echo N > /sys/kernel/debug/$FAILTYPE/task-filter -echo 100 > /sys/kernel/debug/$FAILTYPE/probability -echo 0 > /sys/kernel/debug/$FAILTYPE/interval -echo -1 > /sys/kernel/debug/$FAILTYPE/times -echo 0 > /sys/kernel/debug/$FAILTYPE/space -echo 1 > /sys/kernel/debug/$FAILTYPE/verbose - -mount -t btrfs $DEVICE tmpmnt -if [ $? -ne 0 ] -then - echo "SUCCESS!" -else - echo "FAILED!" - umount tmpmnt -fi - -echo > /sys/kernel/debug/$FAILTYPE/inject - -rmdir tmpmnt -losetup -d $DEVICE -rm testfile.img - - -Tool to run command with failslab or fail_page_alloc ----------------------------------------------------- -In order to make it easier to accomplish the tasks mentioned above, we can use -tools/testing/fault-injection/failcmd.sh. Please run a command -"./tools/testing/fault-injection/failcmd.sh --help" for more information and -see the following examples. - -Examples: - -Run a command "make -C tools/testing/selftests/ run_tests" with injecting slab -allocation failure. - - # ./tools/testing/fault-injection/failcmd.sh \ - -- make -C tools/testing/selftests/ run_tests - -Same as above except to specify 100 times failures at most instead of one time -at most by default. - - # ./tools/testing/fault-injection/failcmd.sh --times=100 \ - -- make -C tools/testing/selftests/ run_tests - -Same as above except to inject page allocation failure instead of slab -allocation failure. - - # env FAILCMD_TYPE=fail_page_alloc \ - ./tools/testing/fault-injection/failcmd.sh --times=100 \ - -- make -C tools/testing/selftests/ run_tests - -Systematic faults using fail-nth ---------------------------------- - -The following code systematically faults 0-th, 1-st, 2-nd and so on -capabilities in the socketpair() system call. - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -int main() -{ - int i, err, res, fail_nth, fds[2]; - char buf[128]; - - system("echo N > /sys/kernel/debug/failslab/ignore-gfp-wait"); - sprintf(buf, "/proc/self/task/%ld/fail-nth", syscall(SYS_gettid)); - fail_nth = open(buf, O_RDWR); - for (i = 1;; i++) { - sprintf(buf, "%d", i); - write(fail_nth, buf, strlen(buf)); - res = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds); - err = errno; - pread(fail_nth, buf, sizeof(buf), 0); - if (res == 0) { - close(fds[0]); - close(fds[1]); - } - printf("%d-th fault %c: res=%d/%d\n", i, atoi(buf) ? 'N' : 'Y', - res, err); - if (atoi(buf)) - break; - } - return 0; -} - -An example output: - -1-th fault Y: res=-1/23 -2-th fault Y: res=-1/23 -3-th fault Y: res=-1/12 -4-th fault Y: res=-1/12 -5-th fault Y: res=-1/23 -6-th fault Y: res=-1/23 -7-th fault Y: res=-1/23 -8-th fault Y: res=-1/12 -9-th fault Y: res=-1/12 -10-th fault Y: res=-1/12 -11-th fault Y: res=-1/12 -12-th fault Y: res=-1/12 -13-th fault Y: res=-1/12 -14-th fault Y: res=-1/12 -15-th fault Y: res=-1/12 -16-th fault N: res=0/12 diff --git a/Documentation/fault-injection/index.rst b/Documentation/fault-injection/index.rst new file mode 100644 index 000000000000..92b5639ed07a --- /dev/null +++ b/Documentation/fault-injection/index.rst @@ -0,0 +1,20 @@ +:orphan: + +=============== +fault-injection +=============== + +.. toctree:: + :maxdepth: 1 + + fault-injection + notifier-error-inject + nvme-fault-injection + provoke-crashes + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/fault-injection/notifier-error-inject.rst b/Documentation/fault-injection/notifier-error-inject.rst new file mode 100644 index 000000000000..1668b6e48d3a --- /dev/null +++ b/Documentation/fault-injection/notifier-error-inject.rst @@ -0,0 +1,98 @@ +Notifier error injection +======================== + +Notifier error injection provides the ability to inject artificial errors to +specified notifier chain callbacks. It is useful to test the error handling of +notifier call chain failures which is rarely executed. There are kernel +modules that can be used to test the following notifiers. + + * PM notifier + * Memory hotplug notifier + * powerpc pSeries reconfig notifier + * Netdevice notifier + +PM notifier error injection module +---------------------------------- +This feature is controlled through debugfs interface + + /sys/kernel/debug/notifier-error-inject/pm/actions//error + +Possible PM notifier events to be failed are: + + * PM_HIBERNATION_PREPARE + * PM_SUSPEND_PREPARE + * PM_RESTORE_PREPARE + +Example: Inject PM suspend error (-12 = -ENOMEM):: + + # cd /sys/kernel/debug/notifier-error-inject/pm/ + # echo -12 > actions/PM_SUSPEND_PREPARE/error + # echo mem > /sys/power/state + bash: echo: write error: Cannot allocate memory + +Memory hotplug notifier error injection module +---------------------------------------------- +This feature is controlled through debugfs interface + + /sys/kernel/debug/notifier-error-inject/memory/actions//error + +Possible memory notifier events to be failed are: + + * MEM_GOING_ONLINE + * MEM_GOING_OFFLINE + +Example: Inject memory hotplug offline error (-12 == -ENOMEM):: + + # cd /sys/kernel/debug/notifier-error-inject/memory + # echo -12 > actions/MEM_GOING_OFFLINE/error + # echo offline > /sys/devices/system/memory/memoryXXX/state + bash: echo: write error: Cannot allocate memory + +powerpc pSeries reconfig notifier error injection module +-------------------------------------------------------- +This feature is controlled through debugfs interface + + /sys/kernel/debug/notifier-error-inject/pSeries-reconfig/actions//error + +Possible pSeries reconfig notifier events to be failed are: + + * PSERIES_RECONFIG_ADD + * PSERIES_RECONFIG_REMOVE + * PSERIES_DRCONF_MEM_ADD + * PSERIES_DRCONF_MEM_REMOVE + +Netdevice notifier error injection module +---------------------------------------------- +This feature is controlled through debugfs interface + + /sys/kernel/debug/notifier-error-inject/netdev/actions//error + +Netdevice notifier events which can be failed are: + + * NETDEV_REGISTER + * NETDEV_CHANGEMTU + * NETDEV_CHANGENAME + * NETDEV_PRE_UP + * NETDEV_PRE_TYPE_CHANGE + * NETDEV_POST_INIT + * NETDEV_PRECHANGEMTU + * NETDEV_PRECHANGEUPPER + * NETDEV_CHANGEUPPER + +Example: Inject netdevice mtu change error (-22 == -EINVAL):: + + # cd /sys/kernel/debug/notifier-error-inject/netdev + # echo -22 > actions/NETDEV_CHANGEMTU/error + # ip link set eth0 mtu 1024 + RTNETLINK answers: Invalid argument + +For more usage examples +----------------------- +There are tools/testing/selftests using the notifier error injection features +for CPU and memory notifiers. + + * tools/testing/selftests/cpu-hotplug/on-off-test.sh + * tools/testing/selftests/memory-hotplug/on-off-test.sh + +These scripts first do simple online and offline tests and then do fault +injection tests if notifier error injection module is available. diff --git a/Documentation/fault-injection/notifier-error-inject.txt b/Documentation/fault-injection/notifier-error-inject.txt deleted file mode 100644 index e861d761de24..000000000000 --- a/Documentation/fault-injection/notifier-error-inject.txt +++ /dev/null @@ -1,94 +0,0 @@ -Notifier error injection -======================== - -Notifier error injection provides the ability to inject artificial errors to -specified notifier chain callbacks. It is useful to test the error handling of -notifier call chain failures which is rarely executed. There are kernel -modules that can be used to test the following notifiers. - - * PM notifier - * Memory hotplug notifier - * powerpc pSeries reconfig notifier - * Netdevice notifier - -PM notifier error injection module ----------------------------------- -This feature is controlled through debugfs interface -/sys/kernel/debug/notifier-error-inject/pm/actions//error - -Possible PM notifier events to be failed are: - - * PM_HIBERNATION_PREPARE - * PM_SUSPEND_PREPARE - * PM_RESTORE_PREPARE - -Example: Inject PM suspend error (-12 = -ENOMEM) - - # cd /sys/kernel/debug/notifier-error-inject/pm/ - # echo -12 > actions/PM_SUSPEND_PREPARE/error - # echo mem > /sys/power/state - bash: echo: write error: Cannot allocate memory - -Memory hotplug notifier error injection module ----------------------------------------------- -This feature is controlled through debugfs interface -/sys/kernel/debug/notifier-error-inject/memory/actions//error - -Possible memory notifier events to be failed are: - - * MEM_GOING_ONLINE - * MEM_GOING_OFFLINE - -Example: Inject memory hotplug offline error (-12 == -ENOMEM) - - # cd /sys/kernel/debug/notifier-error-inject/memory - # echo -12 > actions/MEM_GOING_OFFLINE/error - # echo offline > /sys/devices/system/memory/memoryXXX/state - bash: echo: write error: Cannot allocate memory - -powerpc pSeries reconfig notifier error injection module --------------------------------------------------------- -This feature is controlled through debugfs interface -/sys/kernel/debug/notifier-error-inject/pSeries-reconfig/actions//error - -Possible pSeries reconfig notifier events to be failed are: - - * PSERIES_RECONFIG_ADD - * PSERIES_RECONFIG_REMOVE - * PSERIES_DRCONF_MEM_ADD - * PSERIES_DRCONF_MEM_REMOVE - -Netdevice notifier error injection module ----------------------------------------------- -This feature is controlled through debugfs interface -/sys/kernel/debug/notifier-error-inject/netdev/actions//error - -Netdevice notifier events which can be failed are: - - * NETDEV_REGISTER - * NETDEV_CHANGEMTU - * NETDEV_CHANGENAME - * NETDEV_PRE_UP - * NETDEV_PRE_TYPE_CHANGE - * NETDEV_POST_INIT - * NETDEV_PRECHANGEMTU - * NETDEV_PRECHANGEUPPER - * NETDEV_CHANGEUPPER - -Example: Inject netdevice mtu change error (-22 == -EINVAL) - - # cd /sys/kernel/debug/notifier-error-inject/netdev - # echo -22 > actions/NETDEV_CHANGEMTU/error - # ip link set eth0 mtu 1024 - RTNETLINK answers: Invalid argument - -For more usage examples ------------------------ -There are tools/testing/selftests using the notifier error injection features -for CPU and memory notifiers. - - * tools/testing/selftests/cpu-hotplug/on-off-test.sh - * tools/testing/selftests/memory-hotplug/on-off-test.sh - -These scripts first do simple online and offline tests and then do fault -injection tests if notifier error injection module is available. diff --git a/Documentation/fault-injection/nvme-fault-injection.rst b/Documentation/fault-injection/nvme-fault-injection.rst new file mode 100644 index 000000000000..bbb1bf3e8650 --- /dev/null +++ b/Documentation/fault-injection/nvme-fault-injection.rst @@ -0,0 +1,120 @@ +NVMe Fault Injection +==================== +Linux's fault injection framework provides a systematic way to support +error injection via debugfs in the /sys/kernel/debug directory. When +enabled, the default NVME_SC_INVALID_OPCODE with no retry will be +injected into the nvme_end_request. Users can change the default status +code and no retry flag via the debugfs. The list of Generic Command +Status can be found in include/linux/nvme.h + +Following examples show how to inject an error into the nvme. + +First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config, +recompile the kernel. After booting up the kernel, do the +following. + +Example 1: Inject default status code with no retry +--------------------------------------------------- + +:: + + mount /dev/nvme0n1 /mnt + echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times + echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability + cp a.file /mnt + +Expected Result:: + + cp: cannot stat ‘/mnt/a.file’: Input/output error + +Message from dmesg:: + + FAULT_INJECTION: forcing a failure. + name fault_inject, interval 1, probability 100, space 0, times 1 + CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2 + Hardware name: innotek GmbH VirtualBox/VirtualBox, + BIOS VirtualBox 12/01/2006 + Call Trace: + + dump_stack+0x5c/0x7d + should_fail+0x148/0x170 + nvme_should_fail+0x2f/0x50 [nvme_core] + nvme_process_cq+0xe7/0x1d0 [nvme] + nvme_irq+0x1e/0x40 [nvme] + __handle_irq_event_percpu+0x3a/0x190 + handle_irq_event_percpu+0x30/0x70 + handle_irq_event+0x36/0x60 + handle_fasteoi_irq+0x78/0x120 + handle_irq+0xa7/0x130 + ? tick_irq_enter+0xa8/0xc0 + do_IRQ+0x43/0xc0 + common_interrupt+0xa2/0xa2 + + RIP: 0010:native_safe_halt+0x2/0x10 + RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd + RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000 + RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 + RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000 + R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480 + R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000 + ? __sched_text_end+0x4/0x4 + default_idle+0x18/0xf0 + do_idle+0x150/0x1d0 + cpu_startup_entry+0x6f/0x80 + start_kernel+0x4c4/0x4e4 + ? set_init_arg+0x55/0x55 + secondary_startup_64+0xa5/0xb0 + print_req_error: I/O error, dev nvme0n1, sector 9240 + EXT4-fs error (device nvme0n1): ext4_find_entry:1436: + inode #2: comm cp: reading directory lblock 0 + +Example 2: Inject default status code with retry +------------------------------------------------ + +:: + + mount /dev/nvme0n1 /mnt + echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times + echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability + echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/status + echo 0 > /sys/kernel/debug/nvme0n1/fault_inject/dont_retry + + cp a.file /mnt + +Expected Result:: + + command success without error + +Message from dmesg:: + + FAULT_INJECTION: forcing a failure. + name fault_inject, interval 1, probability 100, space 0, times 1 + CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc8+ #4 + Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 + Call Trace: + + dump_stack+0x5c/0x7d + should_fail+0x148/0x170 + nvme_should_fail+0x30/0x60 [nvme_core] + nvme_loop_queue_response+0x84/0x110 [nvme_loop] + nvmet_req_complete+0x11/0x40 [nvmet] + nvmet_bio_done+0x28/0x40 [nvmet] + blk_update_request+0xb0/0x310 + blk_mq_end_request+0x18/0x60 + flush_smp_call_function_queue+0x3d/0xf0 + smp_call_function_single_interrupt+0x2c/0xc0 + call_function_single_interrupt+0xa2/0xb0 + + RIP: 0010:native_safe_halt+0x2/0x10 + RSP: 0018:ffffc9000068bec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04 + RAX: ffffffff817a10c0 RBX: ffff88011a3c9680 RCX: 0000000000000000 + RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 + RBP: 0000000000000001 R08: 000000008e38c131 R09: 0000000000000000 + R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011a3c9680 + R13: ffff88011a3c9680 R14: 0000000000000000 R15: 0000000000000000 + ? __sched_text_end+0x4/0x4 + default_idle+0x18/0xf0 + do_idle+0x150/0x1d0 + cpu_startup_entry+0x6f/0x80 + start_secondary+0x187/0x1e0 + secondary_startup_64+0xa5/0xb0 diff --git a/Documentation/fault-injection/nvme-fault-injection.txt b/Documentation/fault-injection/nvme-fault-injection.txt deleted file mode 100644 index 8fbf3bf60b62..000000000000 --- a/Documentation/fault-injection/nvme-fault-injection.txt +++ /dev/null @@ -1,116 +0,0 @@ -NVMe Fault Injection -==================== -Linux's fault injection framework provides a systematic way to support -error injection via debugfs in the /sys/kernel/debug directory. When -enabled, the default NVME_SC_INVALID_OPCODE with no retry will be -injected into the nvme_end_request. Users can change the default status -code and no retry flag via the debugfs. The list of Generic Command -Status can be found in include/linux/nvme.h - -Following examples show how to inject an error into the nvme. - -First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config, -recompile the kernel. After booting up the kernel, do the -following. - -Example 1: Inject default status code with no retry ---------------------------------------------------- - -mount /dev/nvme0n1 /mnt -echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times -echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability -cp a.file /mnt - -Expected Result: - -cp: cannot stat ‘/mnt/a.file’: Input/output error - -Message from dmesg: - -FAULT_INJECTION: forcing a failure. -name fault_inject, interval 1, probability 100, space 0, times 1 -CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2 -Hardware name: innotek GmbH VirtualBox/VirtualBox, -BIOS VirtualBox 12/01/2006 -Call Trace: - - dump_stack+0x5c/0x7d - should_fail+0x148/0x170 - nvme_should_fail+0x2f/0x50 [nvme_core] - nvme_process_cq+0xe7/0x1d0 [nvme] - nvme_irq+0x1e/0x40 [nvme] - __handle_irq_event_percpu+0x3a/0x190 - handle_irq_event_percpu+0x30/0x70 - handle_irq_event+0x36/0x60 - handle_fasteoi_irq+0x78/0x120 - handle_irq+0xa7/0x130 - ? tick_irq_enter+0xa8/0xc0 - do_IRQ+0x43/0xc0 - common_interrupt+0xa2/0xa2 - -RIP: 0010:native_safe_halt+0x2/0x10 -RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd -RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000 -RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 -RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000 -R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480 -R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000 - ? __sched_text_end+0x4/0x4 - default_idle+0x18/0xf0 - do_idle+0x150/0x1d0 - cpu_startup_entry+0x6f/0x80 - start_kernel+0x4c4/0x4e4 - ? set_init_arg+0x55/0x55 - secondary_startup_64+0xa5/0xb0 - print_req_error: I/O error, dev nvme0n1, sector 9240 -EXT4-fs error (device nvme0n1): ext4_find_entry:1436: -inode #2: comm cp: reading directory lblock 0 - -Example 2: Inject default status code with retry ------------------------------------------------- - -mount /dev/nvme0n1 /mnt -echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times -echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability -echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/status -echo 0 > /sys/kernel/debug/nvme0n1/fault_inject/dont_retry - -cp a.file /mnt - -Expected Result: - -command success without error - -Message from dmesg: - -FAULT_INJECTION: forcing a failure. -name fault_inject, interval 1, probability 100, space 0, times 1 -CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc8+ #4 -Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 -Call Trace: - - dump_stack+0x5c/0x7d - should_fail+0x148/0x170 - nvme_should_fail+0x30/0x60 [nvme_core] - nvme_loop_queue_response+0x84/0x110 [nvme_loop] - nvmet_req_complete+0x11/0x40 [nvmet] - nvmet_bio_done+0x28/0x40 [nvmet] - blk_update_request+0xb0/0x310 - blk_mq_end_request+0x18/0x60 - flush_smp_call_function_queue+0x3d/0xf0 - smp_call_function_single_interrupt+0x2c/0xc0 - call_function_single_interrupt+0xa2/0xb0 - -RIP: 0010:native_safe_halt+0x2/0x10 -RSP: 0018:ffffc9000068bec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04 -RAX: ffffffff817a10c0 RBX: ffff88011a3c9680 RCX: 0000000000000000 -RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 -RBP: 0000000000000001 R08: 000000008e38c131 R09: 0000000000000000 -R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011a3c9680 -R13: ffff88011a3c9680 R14: 0000000000000000 R15: 0000000000000000 - ? __sched_text_end+0x4/0x4 - default_idle+0x18/0xf0 - do_idle+0x150/0x1d0 - cpu_startup_entry+0x6f/0x80 - start_secondary+0x187/0x1e0 - secondary_startup_64+0xa5/0xb0 diff --git a/Documentation/fault-injection/provoke-crashes.rst b/Documentation/fault-injection/provoke-crashes.rst new file mode 100644 index 000000000000..9279a3e12278 --- /dev/null +++ b/Documentation/fault-injection/provoke-crashes.rst @@ -0,0 +1,48 @@ +=============== +Provoke crashes +=============== + +The lkdtm module provides an interface to crash or injure the kernel at +predefined crashpoints to evaluate the reliability of crash dumps obtained +using different dumping solutions. The module uses KPROBEs to instrument +crashing points, but can also crash the kernel directly without KRPOBE +support. + + +You can provide the way either through module arguments when inserting +the module, or through a debugfs interface. + +Usage:: + + insmod lkdtm.ko [recur_count={>0}] cpoint_name=<> cpoint_type=<> + [cpoint_count={>0}] + +recur_count + Recursion level for the stack overflow test. Default is 10. + +cpoint_name + Crash point where the kernel is to be crashed. It can be + one of INT_HARDWARE_ENTRY, INT_HW_IRQ_EN, INT_TASKLET_ENTRY, + FS_DEVRW, MEM_SWAPOUT, TIMERADD, SCSI_DISPATCH_CMD, + IDE_CORE_CP, DIRECT + +cpoint_type + Indicates the action to be taken on hitting the crash point. + It can be one of PANIC, BUG, EXCEPTION, LOOP, OVERFLOW, + CORRUPT_STACK, UNALIGNED_LOAD_STORE_WRITE, OVERWRITE_ALLOCATION, + WRITE_AFTER_FREE, + +cpoint_count + Indicates the number of times the crash point is to be hit + to trigger an action. The default is 10. + +You can also induce failures by mounting debugfs and writing the type to +/provoke-crash/. E.g.:: + + mount -t debugfs debugfs /mnt + echo EXCEPTION > /mnt/provoke-crash/INT_HARDWARE_ENTRY + + +A special file is `DIRECT` which will induce the crash directly without +KPROBE instrumentation. This mode is the only one available when the module +is built on a kernel without KPROBEs support. diff --git a/Documentation/fault-injection/provoke-crashes.txt b/Documentation/fault-injection/provoke-crashes.txt deleted file mode 100644 index 7a9d3d81525b..000000000000 --- a/Documentation/fault-injection/provoke-crashes.txt +++ /dev/null @@ -1,38 +0,0 @@ -The lkdtm module provides an interface to crash or injure the kernel at -predefined crashpoints to evaluate the reliability of crash dumps obtained -using different dumping solutions. The module uses KPROBEs to instrument -crashing points, but can also crash the kernel directly without KRPOBE -support. - - -You can provide the way either through module arguments when inserting -the module, or through a debugfs interface. - -Usage: insmod lkdtm.ko [recur_count={>0}] cpoint_name=<> cpoint_type=<> - [cpoint_count={>0}] - - recur_count : Recursion level for the stack overflow test. Default is 10. - - cpoint_name : Crash point where the kernel is to be crashed. It can be - one of INT_HARDWARE_ENTRY, INT_HW_IRQ_EN, INT_TASKLET_ENTRY, - FS_DEVRW, MEM_SWAPOUT, TIMERADD, SCSI_DISPATCH_CMD, - IDE_CORE_CP, DIRECT - - cpoint_type : Indicates the action to be taken on hitting the crash point. - It can be one of PANIC, BUG, EXCEPTION, LOOP, OVERFLOW, - CORRUPT_STACK, UNALIGNED_LOAD_STORE_WRITE, OVERWRITE_ALLOCATION, - WRITE_AFTER_FREE, - - cpoint_count : Indicates the number of times the crash point is to be hit - to trigger an action. The default is 10. - -You can also induce failures by mounting debugfs and writing the type to -/provoke-crash/. E.g., - - mount -t debugfs debugfs /mnt - echo EXCEPTION > /mnt/provoke-crash/INT_HARDWARE_ENTRY - - -A special file is `DIRECT' which will induce the crash directly without -KPROBE instrumentation. This mode is the only one available when the module -is built on a kernel without KPROBEs support. diff --git a/Documentation/process/4.Coding.rst b/Documentation/process/4.Coding.rst index 4b7a5ab3cec1..13dd893c9f88 100644 --- a/Documentation/process/4.Coding.rst +++ b/Documentation/process/4.Coding.rst @@ -298,7 +298,7 @@ enabled, a configurable percentage of memory allocations will be made to fail; these failures can be restricted to a specific range of code. Running with fault injection enabled allows the programmer to see how the code responds when things go badly. See -Documentation/fault-injection/fault-injection.txt for more information on +Documentation/fault-injection/fault-injection.rst for more information on how to use this facility. Other kinds of errors can be found with the "sparse" static analysis tool. diff --git a/Documentation/translations/it_IT/process/4.Coding.rst b/Documentation/translations/it_IT/process/4.Coding.rst index c05b89e616dd..a5e36aa60448 100644 --- a/Documentation/translations/it_IT/process/4.Coding.rst +++ b/Documentation/translations/it_IT/process/4.Coding.rst @@ -314,7 +314,7 @@ di allocazione di memoria sarà destinata al fallimento; questi fallimenti possono essere ridotti ad uno specifico pezzo di codice. Procedere con l'inserimento dei fallimenti attivo permette al programmatore di verificare come il codice risponde quando le cose vanno male. Consultate: -Documentation/fault-injection/fault-injection.txt per avere maggiori +Documentation/fault-injection/fault-injection.rst per avere maggiori informazioni su come utilizzare questo strumento. Altre tipologie di errori possono essere riscontrati con lo strumento di diff --git a/Documentation/translations/zh_CN/process/4.Coding.rst b/Documentation/translations/zh_CN/process/4.Coding.rst index 8bb777941394..b82b1dde3122 100644 --- a/Documentation/translations/zh_CN/process/4.Coding.rst +++ b/Documentation/translations/zh_CN/process/4.Coding.rst @@ -205,7 +205,7 @@ Linus对这个问题给出了最佳答案: 启用故障注入后,内存分配的可配置百分比将失败;这些失败可以限制在特定的代码 范围内。在启用了故障注入的情况下运行,程序员可以看到当情况恶化时代码如何响 应。有关如何使用此工具的详细信息,请参阅 -Documentation/fault-injection/fault-injection.txt。 +Documentation/fault-injection/fault-injection.rst。 使用“sparse”静态分析工具可以发现其他类型的错误。对于sparse,可以警告程序员 用户空间和内核空间地址之间的混淆、big endian和small endian数量的混合、在需 diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index 8a1428d4f138..bba49abb6750 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -15,7 +15,7 @@ * * Debugfs support added by Simon Kagstrom * - * See Documentation/fault-injection/provoke-crashes.txt for instructions + * See Documentation/fault-injection/provoke-crashes.rst for instructions */ #include "lkdtm.h" #include diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index 7e6c77740413..e525f6957c49 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -11,7 +11,7 @@ /* * For explanation of the elements of this struct, see - * Documentation/fault-injection/fault-injection.txt + * Documentation/fault-injection/fault-injection.rst */ struct fault_attr { unsigned long probability; diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index cbdfae379896..4d42a9a6006d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1701,7 +1701,7 @@ config LKDTM called lkdtm. Documentation on how to use the module can be found in - Documentation/fault-injection/provoke-crashes.txt + Documentation/fault-injection/provoke-crashes.rst config TEST_LIST_SORT tristate "Linked list sorting test" diff --git a/tools/testing/fault-injection/failcmd.sh b/tools/testing/fault-injection/failcmd.sh index 29a6c63c5a15..78dac34264be 100644 --- a/tools/testing/fault-injection/failcmd.sh +++ b/tools/testing/fault-injection/failcmd.sh @@ -42,7 +42,7 @@ OPTIONS --interval=value, --space=value, --verbose=value, --task-filter=value, --stacktrace-depth=value, --require-start=value, --require-end=value, --reject-start=value, --reject-end=value, --ignore-gfp-wait=value - See Documentation/fault-injection/fault-injection.txt for more + See Documentation/fault-injection/fault-injection.rst for more information failslab options: -- cgit v1.2.3 From ab42b818954c040fa13639dc031d8541edcecb4b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 12 Jun 2019 14:52:45 -0300 Subject: docs: fb: convert docs to ReST and rename to *.rst The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Also, removed the Maintained by, as requested by Geert. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/fb/api.rst | 307 ++++++++++++++++ Documentation/fb/api.txt | 306 ---------------- Documentation/fb/arkfb.rst | 68 ++++ Documentation/fb/arkfb.txt | 68 ---- Documentation/fb/aty128fb.rst | 75 ++++ Documentation/fb/aty128fb.txt | 72 ---- Documentation/fb/cirrusfb.rst | 94 +++++ Documentation/fb/cirrusfb.txt | 97 ------ Documentation/fb/cmap_xfbdev.rst | 56 +++ Documentation/fb/cmap_xfbdev.txt | 53 --- Documentation/fb/deferred_io.rst | 79 +++++ Documentation/fb/deferred_io.txt | 75 ---- Documentation/fb/efifb.rst | 39 +++ Documentation/fb/efifb.txt | 37 -- Documentation/fb/ep93xx-fb.rst | 140 ++++++++ Documentation/fb/ep93xx-fb.txt | 135 -------- Documentation/fb/fbcon.rst | 350 +++++++++++++++++++ Documentation/fb/fbcon.txt | 347 ------------------- Documentation/fb/framebuffer.rst | 353 +++++++++++++++++++ Documentation/fb/framebuffer.txt | 343 ------------------ Documentation/fb/gxfb.rst | 54 +++ Documentation/fb/gxfb.txt | 52 --- Documentation/fb/index.rst | 50 +++ Documentation/fb/intel810.rst | 287 +++++++++++++++ Documentation/fb/intel810.txt | 278 --------------- Documentation/fb/intelfb.rst | 155 +++++++++ Documentation/fb/intelfb.txt | 149 -------- Documentation/fb/internals.rst | 86 +++++ Documentation/fb/internals.txt | 82 ----- Documentation/fb/lxfb.rst | 55 +++ Documentation/fb/lxfb.txt | 52 --- Documentation/fb/matroxfb.rst | 443 ++++++++++++++++++++++++ Documentation/fb/matroxfb.txt | 413 ---------------------- Documentation/fb/metronomefb.rst | 38 ++ Documentation/fb/metronomefb.txt | 36 -- Documentation/fb/modedb.rst | 155 +++++++++ Documentation/fb/modedb.txt | 151 -------- Documentation/fb/pvr2fb.rst | 66 ++++ Documentation/fb/pvr2fb.txt | 65 ---- Documentation/fb/pxafb.rst | 173 +++++++++ Documentation/fb/pxafb.txt | 142 -------- Documentation/fb/s3fb.rst | 82 +++++ Documentation/fb/s3fb.txt | 82 ----- Documentation/fb/sa1100fb.rst | 40 +++ Documentation/fb/sa1100fb.txt | 39 --- Documentation/fb/sh7760fb.rst | 130 +++++++ Documentation/fb/sh7760fb.txt | 131 ------- Documentation/fb/sisfb.rst | 160 +++++++++ Documentation/fb/sisfb.txt | 158 --------- Documentation/fb/sm501.rst | 15 + Documentation/fb/sm501.txt | 10 - Documentation/fb/sm712fb.rst | 35 ++ Documentation/fb/sm712fb.txt | 31 -- Documentation/fb/sstfb.rst | 207 +++++++++++ Documentation/fb/sstfb.txt | 174 ---------- Documentation/fb/tgafb.rst | 71 ++++ Documentation/fb/tgafb.txt | 69 ---- Documentation/fb/tridentfb.rst | 78 +++++ Documentation/fb/tridentfb.txt | 70 ---- Documentation/fb/udlfb.rst | 162 +++++++++ Documentation/fb/udlfb.txt | 159 --------- Documentation/fb/uvesafb.rst | 188 ++++++++++ Documentation/fb/uvesafb.txt | 184 ---------- Documentation/fb/vesafb.rst | 192 ++++++++++ Documentation/fb/vesafb.txt | 181 ---------- Documentation/fb/viafb.rst | 297 ++++++++++++++++ Documentation/fb/viafb.txt | 252 -------------- Documentation/fb/vt8623fb.rst | 64 ++++ Documentation/fb/vt8623fb.txt | 64 ---- MAINTAINERS | 10 +- drivers/tty/Kconfig | 2 +- drivers/video/fbdev/Kconfig | 24 +- drivers/video/fbdev/matrox/matroxfb_base.c | 2 +- drivers/video/fbdev/pxafb.c | 2 +- drivers/video/fbdev/sh7760fb.c | 2 +- 76 files changed, 4866 insertions(+), 4579 deletions(-) create mode 100644 Documentation/fb/api.rst delete mode 100644 Documentation/fb/api.txt create mode 100644 Documentation/fb/arkfb.rst delete mode 100644 Documentation/fb/arkfb.txt create mode 100644 Documentation/fb/aty128fb.rst delete mode 100644 Documentation/fb/aty128fb.txt create mode 100644 Documentation/fb/cirrusfb.rst delete mode 100644 Documentation/fb/cirrusfb.txt create mode 100644 Documentation/fb/cmap_xfbdev.rst delete mode 100644 Documentation/fb/cmap_xfbdev.txt create mode 100644 Documentation/fb/deferred_io.rst delete mode 100644 Documentation/fb/deferred_io.txt create mode 100644 Documentation/fb/efifb.rst delete mode 100644 Documentation/fb/efifb.txt create mode 100644 Documentation/fb/ep93xx-fb.rst delete mode 100644 Documentation/fb/ep93xx-fb.txt create mode 100644 Documentation/fb/fbcon.rst delete mode 100644 Documentation/fb/fbcon.txt create mode 100644 Documentation/fb/framebuffer.rst delete mode 100644 Documentation/fb/framebuffer.txt create mode 100644 Documentation/fb/gxfb.rst delete mode 100644 Documentation/fb/gxfb.txt create mode 100644 Documentation/fb/index.rst create mode 100644 Documentation/fb/intel810.rst delete mode 100644 Documentation/fb/intel810.txt create mode 100644 Documentation/fb/intelfb.rst delete mode 100644 Documentation/fb/intelfb.txt create mode 100644 Documentation/fb/internals.rst delete mode 100644 Documentation/fb/internals.txt create mode 100644 Documentation/fb/lxfb.rst delete mode 100644 Documentation/fb/lxfb.txt create mode 100644 Documentation/fb/matroxfb.rst delete mode 100644 Documentation/fb/matroxfb.txt create mode 100644 Documentation/fb/metronomefb.rst delete mode 100644 Documentation/fb/metronomefb.txt create mode 100644 Documentation/fb/modedb.rst delete mode 100644 Documentation/fb/modedb.txt create mode 100644 Documentation/fb/pvr2fb.rst delete mode 100644 Documentation/fb/pvr2fb.txt create mode 100644 Documentation/fb/pxafb.rst delete mode 100644 Documentation/fb/pxafb.txt create mode 100644 Documentation/fb/s3fb.rst delete mode 100644 Documentation/fb/s3fb.txt create mode 100644 Documentation/fb/sa1100fb.rst delete mode 100644 Documentation/fb/sa1100fb.txt create mode 100644 Documentation/fb/sh7760fb.rst delete mode 100644 Documentation/fb/sh7760fb.txt create mode 100644 Documentation/fb/sisfb.rst delete mode 100644 Documentation/fb/sisfb.txt create mode 100644 Documentation/fb/sm501.rst delete mode 100644 Documentation/fb/sm501.txt create mode 100644 Documentation/fb/sm712fb.rst delete mode 100644 Documentation/fb/sm712fb.txt create mode 100644 Documentation/fb/sstfb.rst delete mode 100644 Documentation/fb/sstfb.txt create mode 100644 Documentation/fb/tgafb.rst delete mode 100644 Documentation/fb/tgafb.txt create mode 100644 Documentation/fb/tridentfb.rst delete mode 100644 Documentation/fb/tridentfb.txt create mode 100644 Documentation/fb/udlfb.rst delete mode 100644 Documentation/fb/udlfb.txt create mode 100644 Documentation/fb/uvesafb.rst delete mode 100644 Documentation/fb/uvesafb.txt create mode 100644 Documentation/fb/vesafb.rst delete mode 100644 Documentation/fb/vesafb.txt create mode 100644 Documentation/fb/viafb.rst delete mode 100644 Documentation/fb/viafb.txt create mode 100644 Documentation/fb/vt8623fb.rst delete mode 100644 Documentation/fb/vt8623fb.txt (limited to 'drivers') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9b16b640ce48..83d6560f10f0 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5024,7 +5024,7 @@ vector=percpu: enable percpu vector domain video= [FB] Frame buffer configuration - See Documentation/fb/modedb.txt. + See Documentation/fb/modedb.rst. video.brightness_switch_enabled= [0,1] If set to 1, on receiving an ACPI notify event diff --git a/Documentation/fb/api.rst b/Documentation/fb/api.rst new file mode 100644 index 000000000000..79ec33dded74 --- /dev/null +++ b/Documentation/fb/api.rst @@ -0,0 +1,307 @@ +=========================== +The Frame Buffer Device API +=========================== + +Last revised: June 21, 2011 + + +0. Introduction +--------------- + +This document describes the frame buffer API used by applications to interact +with frame buffer devices. In-kernel APIs between device drivers and the frame +buffer core are not described. + +Due to a lack of documentation in the original frame buffer API, drivers +behaviours differ in subtle (and not so subtle) ways. This document describes +the recommended API implementation, but applications should be prepared to +deal with different behaviours. + + +1. Capabilities +--------------- + +Device and driver capabilities are reported in the fixed screen information +capabilities field:: + + struct fb_fix_screeninfo { + ... + __u16 capabilities; /* see FB_CAP_* */ + ... + }; + +Application should use those capabilities to find out what features they can +expect from the device and driver. + +- FB_CAP_FOURCC + +The driver supports the four character code (FOURCC) based format setting API. +When supported, formats are configured using a FOURCC instead of manually +specifying color components layout. + + +2. Types and visuals +-------------------- + +Pixels are stored in memory in hardware-dependent formats. Applications need +to be aware of the pixel storage format in order to write image data to the +frame buffer memory in the format expected by the hardware. + +Formats are described by frame buffer types and visuals. Some visuals require +additional information, which are stored in the variable screen information +bits_per_pixel, grayscale, red, green, blue and transp fields. + +Visuals describe how color information is encoded and assembled to create +macropixels. Types describe how macropixels are stored in memory. The following +types and visuals are supported. + +- FB_TYPE_PACKED_PIXELS + +Macropixels are stored contiguously in a single plane. If the number of bits +per macropixel is not a multiple of 8, whether macropixels are padded to the +next multiple of 8 bits or packed together into bytes depends on the visual. + +Padding at end of lines may be present and is then reported through the fixed +screen information line_length field. + +- FB_TYPE_PLANES + +Macropixels are split across multiple planes. The number of planes is equal to +the number of bits per macropixel, with plane i'th storing i'th bit from all +macropixels. + +Planes are located contiguously in memory. + +- FB_TYPE_INTERLEAVED_PLANES + +Macropixels are split across multiple planes. The number of planes is equal to +the number of bits per macropixel, with plane i'th storing i'th bit from all +macropixels. + +Planes are interleaved in memory. The interleave factor, defined as the +distance in bytes between the beginning of two consecutive interleaved blocks +belonging to different planes, is stored in the fixed screen information +type_aux field. + +- FB_TYPE_FOURCC + +Macropixels are stored in memory as described by the format FOURCC identifier +stored in the variable screen information grayscale field. + +- FB_VISUAL_MONO01 + +Pixels are black or white and stored on a number of bits (typically one) +specified by the variable screen information bpp field. + +Black pixels are represented by all bits set to 1 and white pixels by all bits +set to 0. When the number of bits per pixel is smaller than 8, several pixels +are packed together in a byte. + +FB_VISUAL_MONO01 is currently used with FB_TYPE_PACKED_PIXELS only. + +- FB_VISUAL_MONO10 + +Pixels are black or white and stored on a number of bits (typically one) +specified by the variable screen information bpp field. + +Black pixels are represented by all bits set to 0 and white pixels by all bits +set to 1. When the number of bits per pixel is smaller than 8, several pixels +are packed together in a byte. + +FB_VISUAL_MONO01 is currently used with FB_TYPE_PACKED_PIXELS only. + +- FB_VISUAL_TRUECOLOR + +Pixels are broken into red, green and blue components, and each component +indexes a read-only lookup table for the corresponding value. Lookup tables +are device-dependent, and provide linear or non-linear ramps. + +Each component is stored in a macropixel according to the variable screen +information red, green, blue and transp fields. + +- FB_VISUAL_PSEUDOCOLOR and FB_VISUAL_STATIC_PSEUDOCOLOR + +Pixel values are encoded as indices into a colormap that stores red, green and +blue components. The colormap is read-only for FB_VISUAL_STATIC_PSEUDOCOLOR +and read-write for FB_VISUAL_PSEUDOCOLOR. + +Each pixel value is stored in the number of bits reported by the variable +screen information bits_per_pixel field. + +- FB_VISUAL_DIRECTCOLOR + +Pixels are broken into red, green and blue components, and each component +indexes a programmable lookup table for the corresponding value. + +Each component is stored in a macropixel according to the variable screen +information red, green, blue and transp fields. + +- FB_VISUAL_FOURCC + +Pixels are encoded and interpreted as described by the format FOURCC +identifier stored in the variable screen information grayscale field. + + +3. Screen information +--------------------- + +Screen information are queried by applications using the FBIOGET_FSCREENINFO +and FBIOGET_VSCREENINFO ioctls. Those ioctls take a pointer to a +fb_fix_screeninfo and fb_var_screeninfo structure respectively. + +struct fb_fix_screeninfo stores device independent unchangeable information +about the frame buffer device and the current format. Those information can't +be directly modified by applications, but can be changed by the driver when an +application modifies the format:: + + struct fb_fix_screeninfo { + char id[16]; /* identification string eg "TT Builtin" */ + unsigned long smem_start; /* Start of frame buffer mem */ + /* (physical address) */ + __u32 smem_len; /* Length of frame buffer mem */ + __u32 type; /* see FB_TYPE_* */ + __u32 type_aux; /* Interleave for interleaved Planes */ + __u32 visual; /* see FB_VISUAL_* */ + __u16 xpanstep; /* zero if no hardware panning */ + __u16 ypanstep; /* zero if no hardware panning */ + __u16 ywrapstep; /* zero if no hardware ywrap */ + __u32 line_length; /* length of a line in bytes */ + unsigned long mmio_start; /* Start of Memory Mapped I/O */ + /* (physical address) */ + __u32 mmio_len; /* Length of Memory Mapped I/O */ + __u32 accel; /* Indicate to driver which */ + /* specific chip/card we have */ + __u16 capabilities; /* see FB_CAP_* */ + __u16 reserved[2]; /* Reserved for future compatibility */ + }; + +struct fb_var_screeninfo stores device independent changeable information +about a frame buffer device, its current format and video mode, as well as +other miscellaneous parameters:: + + struct fb_var_screeninfo { + __u32 xres; /* visible resolution */ + __u32 yres; + __u32 xres_virtual; /* virtual resolution */ + __u32 yres_virtual; + __u32 xoffset; /* offset from virtual to visible */ + __u32 yoffset; /* resolution */ + + __u32 bits_per_pixel; /* guess what */ + __u32 grayscale; /* 0 = color, 1 = grayscale, */ + /* >1 = FOURCC */ + struct fb_bitfield red; /* bitfield in fb mem if true color, */ + struct fb_bitfield green; /* else only length is significant */ + struct fb_bitfield blue; + struct fb_bitfield transp; /* transparency */ + + __u32 nonstd; /* != 0 Non standard pixel format */ + + __u32 activate; /* see FB_ACTIVATE_* */ + + __u32 height; /* height of picture in mm */ + __u32 width; /* width of picture in mm */ + + __u32 accel_flags; /* (OBSOLETE) see fb_info.flags */ + + /* Timing: All values in pixclocks, except pixclock (of course) */ + __u32 pixclock; /* pixel clock in ps (pico seconds) */ + __u32 left_margin; /* time from sync to picture */ + __u32 right_margin; /* time from picture to sync */ + __u32 upper_margin; /* time from sync to picture */ + __u32 lower_margin; + __u32 hsync_len; /* length of horizontal sync */ + __u32 vsync_len; /* length of vertical sync */ + __u32 sync; /* see FB_SYNC_* */ + __u32 vmode; /* see FB_VMODE_* */ + __u32 rotate; /* angle we rotate counter clockwise */ + __u32 colorspace; /* colorspace for FOURCC-based modes */ + __u32 reserved[4]; /* Reserved for future compatibility */ + }; + +To modify variable information, applications call the FBIOPUT_VSCREENINFO +ioctl with a pointer to a fb_var_screeninfo structure. If the call is +successful, the driver will update the fixed screen information accordingly. + +Instead of filling the complete fb_var_screeninfo structure manually, +applications should call the FBIOGET_VSCREENINFO ioctl and modify only the +fields they care about. + + +4. Format configuration +----------------------- + +Frame buffer devices offer two ways to configure the frame buffer format: the +legacy API and the FOURCC-based API. + + +The legacy API has been the only frame buffer format configuration API for a +long time and is thus widely used by application. It is the recommended API +for applications when using RGB and grayscale formats, as well as legacy +non-standard formats. + +To select a format, applications set the fb_var_screeninfo bits_per_pixel field +to the desired frame buffer depth. Values up to 8 will usually map to +monochrome, grayscale or pseudocolor visuals, although this is not required. + +- For grayscale formats, applications set the grayscale field to one. The red, + blue, green and transp fields must be set to 0 by applications and ignored by + drivers. Drivers must fill the red, blue and green offsets to 0 and lengths + to the bits_per_pixel value. + +- For pseudocolor formats, applications set the grayscale field to zero. The + red, blue, green and transp fields must be set to 0 by applications and + ignored by drivers. Drivers must fill the red, blue and green offsets to 0 + and lengths to the bits_per_pixel value. + +- For truecolor and directcolor formats, applications set the grayscale field + to zero, and the red, blue, green and transp fields to describe the layout of + color components in memory:: + + struct fb_bitfield { + __u32 offset; /* beginning of bitfield */ + __u32 length; /* length of bitfield */ + __u32 msb_right; /* != 0 : Most significant bit is */ + /* right */ + }; + + Pixel values are bits_per_pixel wide and are split in non-overlapping red, + green, blue and alpha (transparency) components. Location and size of each + component in the pixel value are described by the fb_bitfield offset and + length fields. Offset are computed from the right. + + Pixels are always stored in an integer number of bytes. If the number of + bits per pixel is not a multiple of 8, pixel values are padded to the next + multiple of 8 bits. + +Upon successful format configuration, drivers update the fb_fix_screeninfo +type, visual and line_length fields depending on the selected format. + + +The FOURCC-based API replaces format descriptions by four character codes +(FOURCC). FOURCCs are abstract identifiers that uniquely define a format +without explicitly describing it. This is the only API that supports YUV +formats. Drivers are also encouraged to implement the FOURCC-based API for RGB +and grayscale formats. + +Drivers that support the FOURCC-based API report this capability by setting +the FB_CAP_FOURCC bit in the fb_fix_screeninfo capabilities field. + +FOURCC definitions are located in the linux/videodev2.h header. However, and +despite starting with the V4L2_PIX_FMT_prefix, they are not restricted to V4L2 +and don't require usage of the V4L2 subsystem. FOURCC documentation is +available in Documentation/media/uapi/v4l/pixfmt.rst. + +To select a format, applications set the grayscale field to the desired FOURCC. +For YUV formats, they should also select the appropriate colorspace by setting +the colorspace field to one of the colorspaces listed in linux/videodev2.h and +documented in Documentation/media/uapi/v4l/colorspaces.rst. + +The red, green, blue and transp fields are not used with the FOURCC-based API. +For forward compatibility reasons applications must zero those fields, and +drivers must ignore them. Values other than 0 may get a meaning in future +extensions. + +Upon successful format configuration, drivers update the fb_fix_screeninfo +type, visual and line_length fields depending on the selected format. The type +and visual fields are set to FB_TYPE_FOURCC and FB_VISUAL_FOURCC respectively. diff --git a/Documentation/fb/api.txt b/Documentation/fb/api.txt deleted file mode 100644 index d52cf1e3b975..000000000000 --- a/Documentation/fb/api.txt +++ /dev/null @@ -1,306 +0,0 @@ - The Frame Buffer Device API - --------------------------- - -Last revised: June 21, 2011 - - -0. Introduction ---------------- - -This document describes the frame buffer API used by applications to interact -with frame buffer devices. In-kernel APIs between device drivers and the frame -buffer core are not described. - -Due to a lack of documentation in the original frame buffer API, drivers -behaviours differ in subtle (and not so subtle) ways. This document describes -the recommended API implementation, but applications should be prepared to -deal with different behaviours. - - -1. Capabilities ---------------- - -Device and driver capabilities are reported in the fixed screen information -capabilities field. - -struct fb_fix_screeninfo { - ... - __u16 capabilities; /* see FB_CAP_* */ - ... -}; - -Application should use those capabilities to find out what features they can -expect from the device and driver. - -- FB_CAP_FOURCC - -The driver supports the four character code (FOURCC) based format setting API. -When supported, formats are configured using a FOURCC instead of manually -specifying color components layout. - - -2. Types and visuals --------------------- - -Pixels are stored in memory in hardware-dependent formats. Applications need -to be aware of the pixel storage format in order to write image data to the -frame buffer memory in the format expected by the hardware. - -Formats are described by frame buffer types and visuals. Some visuals require -additional information, which are stored in the variable screen information -bits_per_pixel, grayscale, red, green, blue and transp fields. - -Visuals describe how color information is encoded and assembled to create -macropixels. Types describe how macropixels are stored in memory. The following -types and visuals are supported. - -- FB_TYPE_PACKED_PIXELS - -Macropixels are stored contiguously in a single plane. If the number of bits -per macropixel is not a multiple of 8, whether macropixels are padded to the -next multiple of 8 bits or packed together into bytes depends on the visual. - -Padding at end of lines may be present and is then reported through the fixed -screen information line_length field. - -- FB_TYPE_PLANES - -Macropixels are split across multiple planes. The number of planes is equal to -the number of bits per macropixel, with plane i'th storing i'th bit from all -macropixels. - -Planes are located contiguously in memory. - -- FB_TYPE_INTERLEAVED_PLANES - -Macropixels are split across multiple planes. The number of planes is equal to -the number of bits per macropixel, with plane i'th storing i'th bit from all -macropixels. - -Planes are interleaved in memory. The interleave factor, defined as the -distance in bytes between the beginning of two consecutive interleaved blocks -belonging to different planes, is stored in the fixed screen information -type_aux field. - -- FB_TYPE_FOURCC - -Macropixels are stored in memory as described by the format FOURCC identifier -stored in the variable screen information grayscale field. - -- FB_VISUAL_MONO01 - -Pixels are black or white and stored on a number of bits (typically one) -specified by the variable screen information bpp field. - -Black pixels are represented by all bits set to 1 and white pixels by all bits -set to 0. When the number of bits per pixel is smaller than 8, several pixels -are packed together in a byte. - -FB_VISUAL_MONO01 is currently used with FB_TYPE_PACKED_PIXELS only. - -- FB_VISUAL_MONO10 - -Pixels are black or white and stored on a number of bits (typically one) -specified by the variable screen information bpp field. - -Black pixels are represented by all bits set to 0 and white pixels by all bits -set to 1. When the number of bits per pixel is smaller than 8, several pixels -are packed together in a byte. - -FB_VISUAL_MONO01 is currently used with FB_TYPE_PACKED_PIXELS only. - -- FB_VISUAL_TRUECOLOR - -Pixels are broken into red, green and blue components, and each component -indexes a read-only lookup table for the corresponding value. Lookup tables -are device-dependent, and provide linear or non-linear ramps. - -Each component is stored in a macropixel according to the variable screen -information red, green, blue and transp fields. - -- FB_VISUAL_PSEUDOCOLOR and FB_VISUAL_STATIC_PSEUDOCOLOR - -Pixel values are encoded as indices into a colormap that stores red, green and -blue components. The colormap is read-only for FB_VISUAL_STATIC_PSEUDOCOLOR -and read-write for FB_VISUAL_PSEUDOCOLOR. - -Each pixel value is stored in the number of bits reported by the variable -screen information bits_per_pixel field. - -- FB_VISUAL_DIRECTCOLOR - -Pixels are broken into red, green and blue components, and each component -indexes a programmable lookup table for the corresponding value. - -Each component is stored in a macropixel according to the variable screen -information red, green, blue and transp fields. - -- FB_VISUAL_FOURCC - -Pixels are encoded and interpreted as described by the format FOURCC -identifier stored in the variable screen information grayscale field. - - -3. Screen information ---------------------- - -Screen information are queried by applications using the FBIOGET_FSCREENINFO -and FBIOGET_VSCREENINFO ioctls. Those ioctls take a pointer to a -fb_fix_screeninfo and fb_var_screeninfo structure respectively. - -struct fb_fix_screeninfo stores device independent unchangeable information -about the frame buffer device and the current format. Those information can't -be directly modified by applications, but can be changed by the driver when an -application modifies the format. - -struct fb_fix_screeninfo { - char id[16]; /* identification string eg "TT Builtin" */ - unsigned long smem_start; /* Start of frame buffer mem */ - /* (physical address) */ - __u32 smem_len; /* Length of frame buffer mem */ - __u32 type; /* see FB_TYPE_* */ - __u32 type_aux; /* Interleave for interleaved Planes */ - __u32 visual; /* see FB_VISUAL_* */ - __u16 xpanstep; /* zero if no hardware panning */ - __u16 ypanstep; /* zero if no hardware panning */ - __u16 ywrapstep; /* zero if no hardware ywrap */ - __u32 line_length; /* length of a line in bytes */ - unsigned long mmio_start; /* Start of Memory Mapped I/O */ - /* (physical address) */ - __u32 mmio_len; /* Length of Memory Mapped I/O */ - __u32 accel; /* Indicate to driver which */ - /* specific chip/card we have */ - __u16 capabilities; /* see FB_CAP_* */ - __u16 reserved[2]; /* Reserved for future compatibility */ -}; - -struct fb_var_screeninfo stores device independent changeable information -about a frame buffer device, its current format and video mode, as well as -other miscellaneous parameters. - -struct fb_var_screeninfo { - __u32 xres; /* visible resolution */ - __u32 yres; - __u32 xres_virtual; /* virtual resolution */ - __u32 yres_virtual; - __u32 xoffset; /* offset from virtual to visible */ - __u32 yoffset; /* resolution */ - - __u32 bits_per_pixel; /* guess what */ - __u32 grayscale; /* 0 = color, 1 = grayscale, */ - /* >1 = FOURCC */ - struct fb_bitfield red; /* bitfield in fb mem if true color, */ - struct fb_bitfield green; /* else only length is significant */ - struct fb_bitfield blue; - struct fb_bitfield transp; /* transparency */ - - __u32 nonstd; /* != 0 Non standard pixel format */ - - __u32 activate; /* see FB_ACTIVATE_* */ - - __u32 height; /* height of picture in mm */ - __u32 width; /* width of picture in mm */ - - __u32 accel_flags; /* (OBSOLETE) see fb_info.flags */ - - /* Timing: All values in pixclocks, except pixclock (of course) */ - __u32 pixclock; /* pixel clock in ps (pico seconds) */ - __u32 left_margin; /* time from sync to picture */ - __u32 right_margin; /* time from picture to sync */ - __u32 upper_margin; /* time from sync to picture */ - __u32 lower_margin; - __u32 hsync_len; /* length of horizontal sync */ - __u32 vsync_len; /* length of vertical sync */ - __u32 sync; /* see FB_SYNC_* */ - __u32 vmode; /* see FB_VMODE_* */ - __u32 rotate; /* angle we rotate counter clockwise */ - __u32 colorspace; /* colorspace for FOURCC-based modes */ - __u32 reserved[4]; /* Reserved for future compatibility */ -}; - -To modify variable information, applications call the FBIOPUT_VSCREENINFO -ioctl with a pointer to a fb_var_screeninfo structure. If the call is -successful, the driver will update the fixed screen information accordingly. - -Instead of filling the complete fb_var_screeninfo structure manually, -applications should call the FBIOGET_VSCREENINFO ioctl and modify only the -fields they care about. - - -4. Format configuration ------------------------ - -Frame buffer devices offer two ways to configure the frame buffer format: the -legacy API and the FOURCC-based API. - - -The legacy API has been the only frame buffer format configuration API for a -long time and is thus widely used by application. It is the recommended API -for applications when using RGB and grayscale formats, as well as legacy -non-standard formats. - -To select a format, applications set the fb_var_screeninfo bits_per_pixel field -to the desired frame buffer depth. Values up to 8 will usually map to -monochrome, grayscale or pseudocolor visuals, although this is not required. - -- For grayscale formats, applications set the grayscale field to one. The red, - blue, green and transp fields must be set to 0 by applications and ignored by - drivers. Drivers must fill the red, blue and green offsets to 0 and lengths - to the bits_per_pixel value. - -- For pseudocolor formats, applications set the grayscale field to zero. The - red, blue, green and transp fields must be set to 0 by applications and - ignored by drivers. Drivers must fill the red, blue and green offsets to 0 - and lengths to the bits_per_pixel value. - -- For truecolor and directcolor formats, applications set the grayscale field - to zero, and the red, blue, green and transp fields to describe the layout of - color components in memory. - -struct fb_bitfield { - __u32 offset; /* beginning of bitfield */ - __u32 length; /* length of bitfield */ - __u32 msb_right; /* != 0 : Most significant bit is */ - /* right */ -}; - - Pixel values are bits_per_pixel wide and are split in non-overlapping red, - green, blue and alpha (transparency) components. Location and size of each - component in the pixel value are described by the fb_bitfield offset and - length fields. Offset are computed from the right. - - Pixels are always stored in an integer number of bytes. If the number of - bits per pixel is not a multiple of 8, pixel values are padded to the next - multiple of 8 bits. - -Upon successful format configuration, drivers update the fb_fix_screeninfo -type, visual and line_length fields depending on the selected format. - - -The FOURCC-based API replaces format descriptions by four character codes -(FOURCC). FOURCCs are abstract identifiers that uniquely define a format -without explicitly describing it. This is the only API that supports YUV -formats. Drivers are also encouraged to implement the FOURCC-based API for RGB -and grayscale formats. - -Drivers that support the FOURCC-based API report this capability by setting -the FB_CAP_FOURCC bit in the fb_fix_screeninfo capabilities field. - -FOURCC definitions are located in the linux/videodev2.h header. However, and -despite starting with the V4L2_PIX_FMT_prefix, they are not restricted to V4L2 -and don't require usage of the V4L2 subsystem. FOURCC documentation is -available in Documentation/media/uapi/v4l/pixfmt.rst. - -To select a format, applications set the grayscale field to the desired FOURCC. -For YUV formats, they should also select the appropriate colorspace by setting -the colorspace field to one of the colorspaces listed in linux/videodev2.h and -documented in Documentation/media/uapi/v4l/colorspaces.rst. - -The red, green, blue and transp fields are not used with the FOURCC-based API. -For forward compatibility reasons applications must zero those fields, and -drivers must ignore them. Values other than 0 may get a meaning in future -extensions. - -Upon successful format configuration, drivers update the fb_fix_screeninfo -type, visual and line_length fields depending on the selected format. The type -and visual fields are set to FB_TYPE_FOURCC and FB_VISUAL_FOURCC respectively. diff --git a/Documentation/fb/arkfb.rst b/Documentation/fb/arkfb.rst new file mode 100644 index 000000000000..aeca8773dd7e --- /dev/null +++ b/Documentation/fb/arkfb.rst @@ -0,0 +1,68 @@ +======================================== +arkfb - fbdev driver for ARK Logic chips +======================================== + + +Supported Hardware +================== + + ARK 2000PV chip + ICS 5342 ramdac + + - only BIOS initialized VGA devices supported + - probably not working on big endian + + +Supported Features +================== + + * 4 bpp pseudocolor modes (with 18bit palette, two variants) + * 8 bpp pseudocolor mode (with 18bit palette) + * 16 bpp truecolor modes (RGB 555 and RGB 565) + * 24 bpp truecolor mode (RGB 888) + * 32 bpp truecolor mode (RGB 888) + * text mode (activated by bpp = 0) + * doublescan mode variant (not available in text mode) + * panning in both directions + * suspend/resume support + +Text mode is supported even in higher resolutions, but there is limitation to +lower pixclocks (i got maximum about 70 MHz, it is dependent on specific +hardware). This limitation is not enforced by driver. Text mode supports 8bit +wide fonts only (hardware limitation) and 16bit tall fonts (driver +limitation). Unfortunately character attributes (like color) in text mode are +broken for unknown reason, so its usefulness is limited. + +There are two 4 bpp modes. First mode (selected if nonstd == 0) is mode with +packed pixels, high nibble first. Second mode (selected if nonstd == 1) is mode +with interleaved planes (1 byte interleave), MSB first. Both modes support +8bit wide fonts only (driver limitation). + +Suspend/resume works on systems that initialize video card during resume and +if device is active (for example used by fbcon). + + +Missing Features +================ +(alias TODO list) + + * secondary (not initialized by BIOS) device support + * big endian support + * DPMS support + * MMIO support + * interlaced mode variant + * support for fontwidths != 8 in 4 bpp modes + * support for fontheight != 16 in text mode + * hardware cursor + * vsync synchronization + * feature connector support + * acceleration support (8514-like 2D) + + +Known bugs +========== + + * character attributes (and cursor) in text mode are broken + +-- +Ondrej Zajicek diff --git a/Documentation/fb/arkfb.txt b/Documentation/fb/arkfb.txt deleted file mode 100644 index e8487a9d6a05..000000000000 --- a/Documentation/fb/arkfb.txt +++ /dev/null @@ -1,68 +0,0 @@ - - arkfb - fbdev driver for ARK Logic chips - ======================================== - - -Supported Hardware -================== - - ARK 2000PV chip - ICS 5342 ramdac - - - only BIOS initialized VGA devices supported - - probably not working on big endian - - -Supported Features -================== - - * 4 bpp pseudocolor modes (with 18bit palette, two variants) - * 8 bpp pseudocolor mode (with 18bit palette) - * 16 bpp truecolor modes (RGB 555 and RGB 565) - * 24 bpp truecolor mode (RGB 888) - * 32 bpp truecolor mode (RGB 888) - * text mode (activated by bpp = 0) - * doublescan mode variant (not available in text mode) - * panning in both directions - * suspend/resume support - -Text mode is supported even in higher resolutions, but there is limitation to -lower pixclocks (i got maximum about 70 MHz, it is dependent on specific -hardware). This limitation is not enforced by driver. Text mode supports 8bit -wide fonts only (hardware limitation) and 16bit tall fonts (driver -limitation). Unfortunately character attributes (like color) in text mode are -broken for unknown reason, so its usefulness is limited. - -There are two 4 bpp modes. First mode (selected if nonstd == 0) is mode with -packed pixels, high nibble first. Second mode (selected if nonstd == 1) is mode -with interleaved planes (1 byte interleave), MSB first. Both modes support -8bit wide fonts only (driver limitation). - -Suspend/resume works on systems that initialize video card during resume and -if device is active (for example used by fbcon). - - -Missing Features -================ -(alias TODO list) - - * secondary (not initialized by BIOS) device support - * big endian support - * DPMS support - * MMIO support - * interlaced mode variant - * support for fontwidths != 8 in 4 bpp modes - * support for fontheight != 16 in text mode - * hardware cursor - * vsync synchronization - * feature connector support - * acceleration support (8514-like 2D) - - -Known bugs -========== - - * character attributes (and cursor) in text mode are broken - --- -Ondrej Zajicek diff --git a/Documentation/fb/aty128fb.rst b/Documentation/fb/aty128fb.rst new file mode 100644 index 000000000000..3f107718f933 --- /dev/null +++ b/Documentation/fb/aty128fb.rst @@ -0,0 +1,75 @@ +================= +What is aty128fb? +================= + +.. [This file is cloned from VesaFB/matroxfb] + +This is a driver for a graphic framebuffer for ATI Rage128 based devices +on Intel and PPC boxes. + +Advantages: + + * It provides a nice large console (128 cols + 48 lines with 1024x768) + without using tiny, unreadable fonts. + * You can run XF68_FBDev on top of /dev/fb0 + * Most important: boot logo :-) + +Disadvantages: + + * graphic mode is slower than text mode... but you should not notice + if you use same resolution as you used in textmode. + * still experimental. + + +How to use it? +============== + +Switching modes is done using the video=aty128fb:... modedb +boot parameter or using `fbset` program. + +See Documentation/fb/modedb.rst for more information on modedb +resolutions. + +You should compile in both vgacon (to boot if you remove your Rage128 from +box) and aty128fb (for graphics mode). You should not compile-in vesafb +unless you have primary display on non-Rage128 VBE2.0 device (see +Documentation/fb/vesafb.rst for details). + + +X11 +=== + +XF68_FBDev should generally work fine, but it is non-accelerated. As of +this document, 8 and 32bpp works fine. There have been palette issues +when switching from X to console and back to X. You will have to restart +X to fix this. + + +Configuration +============= + +You can pass kernel command line options to vesafb with +`video=aty128fb:option1,option2:value2,option3` (multiple options should +be separated by comma, values are separated from options by `:`). +Accepted options: + +========= ======================================================= +noaccel do not use acceleration engine. It is default. +accel use acceleration engine. Not finished. +vmode:x chooses PowerMacintosh video mode . Deprecated. +cmode:x chooses PowerMacintosh colour mode . Deprecated. + selects startup videomode. See modedb.txt for detailed + explanation. Default is 640x480x8bpp. +========= ======================================================= + + +Limitations +=========== + +There are known and unknown bugs, features and misfeatures. +Currently there are following known bugs: + + - This driver is still experimental and is not finished. Too many + bugs/errata to list here. + +Brad Douglas diff --git a/Documentation/fb/aty128fb.txt b/Documentation/fb/aty128fb.txt deleted file mode 100644 index b605204fcfe1..000000000000 --- a/Documentation/fb/aty128fb.txt +++ /dev/null @@ -1,72 +0,0 @@ -[This file is cloned from VesaFB/matroxfb] - -What is aty128fb? -================= - -This is a driver for a graphic framebuffer for ATI Rage128 based devices -on Intel and PPC boxes. - -Advantages: - - * It provides a nice large console (128 cols + 48 lines with 1024x768) - without using tiny, unreadable fonts. - * You can run XF68_FBDev on top of /dev/fb0 - * Most important: boot logo :-) - -Disadvantages: - - * graphic mode is slower than text mode... but you should not notice - if you use same resolution as you used in textmode. - * still experimental. - - -How to use it? -============== - -Switching modes is done using the video=aty128fb:... modedb -boot parameter or using `fbset' program. - -See Documentation/fb/modedb.txt for more information on modedb -resolutions. - -You should compile in both vgacon (to boot if you remove your Rage128 from -box) and aty128fb (for graphics mode). You should not compile-in vesafb -unless you have primary display on non-Rage128 VBE2.0 device (see -Documentation/fb/vesafb.txt for details). - - -X11 -=== - -XF68_FBDev should generally work fine, but it is non-accelerated. As of -this document, 8 and 32bpp works fine. There have been palette issues -when switching from X to console and back to X. You will have to restart -X to fix this. - - -Configuration -============= - -You can pass kernel command line options to vesafb with -`video=aty128fb:option1,option2:value2,option3' (multiple options should -be separated by comma, values are separated from options by `:'). -Accepted options: - -noaccel - do not use acceleration engine. It is default. -accel - use acceleration engine. Not finished. -vmode:x - chooses PowerMacintosh video mode . Deprecated. -cmode:x - chooses PowerMacintosh colour mode . Deprecated. - - selects startup videomode. See modedb.txt for detailed - explanation. Default is 640x480x8bpp. - - -Limitations -=========== - -There are known and unknown bugs, features and misfeatures. -Currently there are following known bugs: - + This driver is still experimental and is not finished. Too many - bugs/errata to list here. - --- -Brad Douglas diff --git a/Documentation/fb/cirrusfb.rst b/Documentation/fb/cirrusfb.rst new file mode 100644 index 000000000000..8c3e6c6cb114 --- /dev/null +++ b/Documentation/fb/cirrusfb.rst @@ -0,0 +1,94 @@ +============================================ +Framebuffer driver for Cirrus Logic chipsets +============================================ + +Copyright 1999 Jeff Garzik + + +.. just a little something to get people going; contributors welcome! + + +Chip families supported: + - SD64 + - Piccolo + - Picasso + - Spectrum + - Alpine (GD-543x/4x) + - Picasso4 (GD-5446) + - GD-5480 + - Laguna (GD-546x) + +Bus's supported: + - PCI + - Zorro + +Architectures supported: + - i386 + - Alpha + - PPC (Motorola Powerstack) + - m68k (Amiga) + + + +Default video modes +------------------- +At the moment, there are two kernel command line arguments supported: + +- mode:640x480 +- mode:800x600 +- mode:1024x768 + +Full support for startup video modes (modedb) will be integrated soon. + +Version 1.9.9.1 +--------------- +* Fix memory detection for 512kB case +* 800x600 mode +* Fixed timings +* Hint for AXP: Use -accel false -vyres -1 when changing resolution + + +Version 1.9.4.4 +--------------- +* Preliminary Laguna support +* Overhaul color register routines. +* Associated with the above, console colors are now obtained from a LUT + called 'palette' instead of from the VGA registers. This code was + modelled after that in atyfb and matroxfb. +* Code cleanup, add comments. +* Overhaul SR07 handling. +* Bug fixes. + + +Version 1.9.4.3 +--------------- +* Correctly set default startup video mode. +* Do not override ram size setting. Define + CLGEN_USE_HARDCODED_RAM_SETTINGS if you _do_ want to override the RAM + setting. +* Compile fixes related to new 2.3.x IORESOURCE_IO[PORT] symbol changes. +* Use new 2.3.x resource allocation. +* Some code cleanup. + + +Version 1.9.4.2 +--------------- +* Casting fixes. +* Assertions no longer cause an oops on purpose. +* Bug fixes. + + +Version 1.9.4.1 +--------------- +* Add compatibility support. Now requires a 2.1.x, 2.2.x or 2.3.x kernel. + + +Version 1.9.4 +------------- +* Several enhancements, smaller memory footprint, a few bugfixes. +* Requires kernel 2.3.14-pre1 or later. + + +Version 1.9.3 +------------- +* Bundled with kernel 2.3.14-pre1 or later. diff --git a/Documentation/fb/cirrusfb.txt b/Documentation/fb/cirrusfb.txt deleted file mode 100644 index f75950d330a4..000000000000 --- a/Documentation/fb/cirrusfb.txt +++ /dev/null @@ -1,97 +0,0 @@ - - Framebuffer driver for Cirrus Logic chipsets - Copyright 1999 Jeff Garzik - - - -{ just a little something to get people going; contributors welcome! } - - - -Chip families supported: - SD64 - Piccolo - Picasso - Spectrum - Alpine (GD-543x/4x) - Picasso4 (GD-5446) - GD-5480 - Laguna (GD-546x) - -Bus's supported: - PCI - Zorro - -Architectures supported: - i386 - Alpha - PPC (Motorola Powerstack) - m68k (Amiga) - - - -Default video modes -------------------- -At the moment, there are two kernel command line arguments supported: - -mode:640x480 -mode:800x600 - or -mode:1024x768 - -Full support for startup video modes (modedb) will be integrated soon. - -Version 1.9.9.1 ---------------- -* Fix memory detection for 512kB case -* 800x600 mode -* Fixed timings -* Hint for AXP: Use -accel false -vyres -1 when changing resolution - - -Version 1.9.4.4 ---------------- -* Preliminary Laguna support -* Overhaul color register routines. -* Associated with the above, console colors are now obtained from a LUT - called 'palette' instead of from the VGA registers. This code was - modelled after that in atyfb and matroxfb. -* Code cleanup, add comments. -* Overhaul SR07 handling. -* Bug fixes. - - -Version 1.9.4.3 ---------------- -* Correctly set default startup video mode. -* Do not override ram size setting. Define - CLGEN_USE_HARDCODED_RAM_SETTINGS if you _do_ want to override the RAM - setting. -* Compile fixes related to new 2.3.x IORESOURCE_IO[PORT] symbol changes. -* Use new 2.3.x resource allocation. -* Some code cleanup. - - -Version 1.9.4.2 ---------------- -* Casting fixes. -* Assertions no longer cause an oops on purpose. -* Bug fixes. - - -Version 1.9.4.1 ---------------- -* Add compatibility support. Now requires a 2.1.x, 2.2.x or 2.3.x kernel. - - -Version 1.9.4 -------------- -* Several enhancements, smaller memory footprint, a few bugfixes. -* Requires kernel 2.3.14-pre1 or later. - - -Version 1.9.3 -------------- -* Bundled with kernel 2.3.14-pre1 or later. - - diff --git a/Documentation/fb/cmap_xfbdev.rst b/Documentation/fb/cmap_xfbdev.rst new file mode 100644 index 000000000000..5db5e9787361 --- /dev/null +++ b/Documentation/fb/cmap_xfbdev.rst @@ -0,0 +1,56 @@ +========================== +Understanding fbdev's cmap +========================== + +These notes explain how X's dix layer uses fbdev's cmap structures. + +- example of relevant structures in fbdev as used for a 3-bit grayscale cmap:: + + struct fb_var_screeninfo { + .bits_per_pixel = 8, + .grayscale = 1, + .red = { 4, 3, 0 }, + .green = { 0, 0, 0 }, + .blue = { 0, 0, 0 }, + } + struct fb_fix_screeninfo { + .visual = FB_VISUAL_STATIC_PSEUDOCOLOR, + } + for (i = 0; i < 8; i++) + info->cmap.red[i] = (((2*i)+1)*(0xFFFF))/16; + memcpy(info->cmap.green, info->cmap.red, sizeof(u16)*8); + memcpy(info->cmap.blue, info->cmap.red, sizeof(u16)*8); + +- X11 apps do something like the following when trying to use grayscale:: + + for (i=0; i < 8; i++) { + char colorspec[64]; + memset(colorspec,0,64); + sprintf(colorspec, "rgb:%x/%x/%x", i*36,i*36,i*36); + if (!XParseColor(outputDisplay, testColormap, colorspec, &wantedColor)) + printf("Can't get color %s\n",colorspec); + XAllocColor(outputDisplay, testColormap, &wantedColor); + grays[i] = wantedColor; + } + +There's also named equivalents like gray1..x provided you have an rgb.txt. + +Somewhere in X's callchain, this results in a call to X code that handles the +colormap. For example, Xfbdev hits the following: + +xc-011010/programs/Xserver/dix/colormap.c:: + + FindBestPixel(pentFirst, size, prgb, channel) + + dr = (long) pent->co.local.red - prgb->red; + dg = (long) pent->co.local.green - prgb->green; + db = (long) pent->co.local.blue - prgb->blue; + sq = dr * dr; + UnsignedToBigNum (sq, &sum); + BigNumAdd (&sum, &temp, &sum); + +co.local.red are entries that were brought in through FBIOGETCMAP which come +directly from the info->cmap.red that was listed above. The prgb is the rgb +that the app wants to match to. The above code is doing what looks like a least +squares matching function. That's why the cmap entries can't be set to the left +hand side boundaries of a color range. diff --git a/Documentation/fb/cmap_xfbdev.txt b/Documentation/fb/cmap_xfbdev.txt deleted file mode 100644 index 55e1f0a3d2b4..000000000000 --- a/Documentation/fb/cmap_xfbdev.txt +++ /dev/null @@ -1,53 +0,0 @@ -Understanding fbdev's cmap --------------------------- - -These notes explain how X's dix layer uses fbdev's cmap structures. - -*. example of relevant structures in fbdev as used for a 3-bit grayscale cmap -struct fb_var_screeninfo { - .bits_per_pixel = 8, - .grayscale = 1, - .red = { 4, 3, 0 }, - .green = { 0, 0, 0 }, - .blue = { 0, 0, 0 }, -} -struct fb_fix_screeninfo { - .visual = FB_VISUAL_STATIC_PSEUDOCOLOR, -} -for (i = 0; i < 8; i++) - info->cmap.red[i] = (((2*i)+1)*(0xFFFF))/16; -memcpy(info->cmap.green, info->cmap.red, sizeof(u16)*8); -memcpy(info->cmap.blue, info->cmap.red, sizeof(u16)*8); - -*. X11 apps do something like the following when trying to use grayscale. -for (i=0; i < 8; i++) { - char colorspec[64]; - memset(colorspec,0,64); - sprintf(colorspec, "rgb:%x/%x/%x", i*36,i*36,i*36); - if (!XParseColor(outputDisplay, testColormap, colorspec, &wantedColor)) - printf("Can't get color %s\n",colorspec); - XAllocColor(outputDisplay, testColormap, &wantedColor); - grays[i] = wantedColor; -} -There's also named equivalents like gray1..x provided you have an rgb.txt. - -Somewhere in X's callchain, this results in a call to X code that handles the -colormap. For example, Xfbdev hits the following: - -xc-011010/programs/Xserver/dix/colormap.c: - -FindBestPixel(pentFirst, size, prgb, channel) - -dr = (long) pent->co.local.red - prgb->red; -dg = (long) pent->co.local.green - prgb->green; -db = (long) pent->co.local.blue - prgb->blue; -sq = dr * dr; -UnsignedToBigNum (sq, &sum); -BigNumAdd (&sum, &temp, &sum); - -co.local.red are entries that were brought in through FBIOGETCMAP which come -directly from the info->cmap.red that was listed above. The prgb is the rgb -that the app wants to match to. The above code is doing what looks like a least -squares matching function. That's why the cmap entries can't be set to the left -hand side boundaries of a color range. - diff --git a/Documentation/fb/deferred_io.rst b/Documentation/fb/deferred_io.rst new file mode 100644 index 000000000000..7300cff255a3 --- /dev/null +++ b/Documentation/fb/deferred_io.rst @@ -0,0 +1,79 @@ +=========== +Deferred IO +=========== + +Deferred IO is a way to delay and repurpose IO. It uses host memory as a +buffer and the MMU pagefault as a pretrigger for when to perform the device +IO. The following example may be a useful explanation of how one such setup +works: + +- userspace app like Xfbdev mmaps framebuffer +- deferred IO and driver sets up fault and page_mkwrite handlers +- userspace app tries to write to mmaped vaddress +- we get pagefault and reach fault handler +- fault handler finds and returns physical page +- we get page_mkwrite where we add this page to a list +- schedule a workqueue task to be run after a delay +- app continues writing to that page with no additional cost. this is + the key benefit. +- the workqueue task comes in and mkcleans the pages on the list, then + completes the work associated with updating the framebuffer. this is + the real work talking to the device. +- app tries to write to the address (that has now been mkcleaned) +- get pagefault and the above sequence occurs again + +As can be seen from above, one benefit is roughly to allow bursty framebuffer +writes to occur at minimum cost. Then after some time when hopefully things +have gone quiet, we go and really update the framebuffer which would be +a relatively more expensive operation. + +For some types of nonvolatile high latency displays, the desired image is +the final image rather than the intermediate stages which is why it's okay +to not update for each write that is occurring. + +It may be the case that this is useful in other scenarios as well. Paul Mundt +has mentioned a case where it is beneficial to use the page count to decide +whether to coalesce and issue SG DMA or to do memory bursts. + +Another one may be if one has a device framebuffer that is in an usual format, +say diagonally shifting RGB, this may then be a mechanism for you to allow +apps to pretend to have a normal framebuffer but reswizzle for the device +framebuffer at vsync time based on the touched pagelist. + +How to use it: (for applications) +--------------------------------- +No changes needed. mmap the framebuffer like normal and just use it. + +How to use it: (for fbdev drivers) +---------------------------------- +The following example may be helpful. + +1. Setup your structure. Eg:: + + static struct fb_deferred_io hecubafb_defio = { + .delay = HZ, + .deferred_io = hecubafb_dpy_deferred_io, + }; + +The delay is the minimum delay between when the page_mkwrite trigger occurs +and when the deferred_io callback is called. The deferred_io callback is +explained below. + +2. Setup your deferred IO callback. Eg:: + + static void hecubafb_dpy_deferred_io(struct fb_info *info, + struct list_head *pagelist) + +The deferred_io callback is where you would perform all your IO to the display +device. You receive the pagelist which is the list of pages that were written +to during the delay. You must not modify this list. This callback is called +from a workqueue. + +3. Call init:: + + info->fbdefio = &hecubafb_defio; + fb_deferred_io_init(info); + +4. Call cleanup:: + + fb_deferred_io_cleanup(info); diff --git a/Documentation/fb/deferred_io.txt b/Documentation/fb/deferred_io.txt deleted file mode 100644 index 748328370250..000000000000 --- a/Documentation/fb/deferred_io.txt +++ /dev/null @@ -1,75 +0,0 @@ -Deferred IO ------------ - -Deferred IO is a way to delay and repurpose IO. It uses host memory as a -buffer and the MMU pagefault as a pretrigger for when to perform the device -IO. The following example may be a useful explanation of how one such setup -works: - -- userspace app like Xfbdev mmaps framebuffer -- deferred IO and driver sets up fault and page_mkwrite handlers -- userspace app tries to write to mmaped vaddress -- we get pagefault and reach fault handler -- fault handler finds and returns physical page -- we get page_mkwrite where we add this page to a list -- schedule a workqueue task to be run after a delay -- app continues writing to that page with no additional cost. this is - the key benefit. -- the workqueue task comes in and mkcleans the pages on the list, then - completes the work associated with updating the framebuffer. this is - the real work talking to the device. -- app tries to write to the address (that has now been mkcleaned) -- get pagefault and the above sequence occurs again - -As can be seen from above, one benefit is roughly to allow bursty framebuffer -writes to occur at minimum cost. Then after some time when hopefully things -have gone quiet, we go and really update the framebuffer which would be -a relatively more expensive operation. - -For some types of nonvolatile high latency displays, the desired image is -the final image rather than the intermediate stages which is why it's okay -to not update for each write that is occurring. - -It may be the case that this is useful in other scenarios as well. Paul Mundt -has mentioned a case where it is beneficial to use the page count to decide -whether to coalesce and issue SG DMA or to do memory bursts. - -Another one may be if one has a device framebuffer that is in an usual format, -say diagonally shifting RGB, this may then be a mechanism for you to allow -apps to pretend to have a normal framebuffer but reswizzle for the device -framebuffer at vsync time based on the touched pagelist. - -How to use it: (for applications) ---------------------------------- -No changes needed. mmap the framebuffer like normal and just use it. - -How to use it: (for fbdev drivers) ----------------------------------- -The following example may be helpful. - -1. Setup your structure. Eg: - -static struct fb_deferred_io hecubafb_defio = { - .delay = HZ, - .deferred_io = hecubafb_dpy_deferred_io, -}; - -The delay is the minimum delay between when the page_mkwrite trigger occurs -and when the deferred_io callback is called. The deferred_io callback is -explained below. - -2. Setup your deferred IO callback. Eg: -static void hecubafb_dpy_deferred_io(struct fb_info *info, - struct list_head *pagelist) - -The deferred_io callback is where you would perform all your IO to the display -device. You receive the pagelist which is the list of pages that were written -to during the delay. You must not modify this list. This callback is called -from a workqueue. - -3. Call init - info->fbdefio = &hecubafb_defio; - fb_deferred_io_init(info); - -4. Call cleanup - fb_deferred_io_cleanup(info); diff --git a/Documentation/fb/efifb.rst b/Documentation/fb/efifb.rst new file mode 100644 index 000000000000..04840331a00e --- /dev/null +++ b/Documentation/fb/efifb.rst @@ -0,0 +1,39 @@ +============== +What is efifb? +============== + +This is a generic EFI platform driver for Intel based Apple computers. +efifb is only for EFI booted Intel Macs. + +Supported Hardware +================== + +- iMac 17"/20" +- Macbook +- Macbook Pro 15"/17" +- MacMini + +How to use it? +============== + +efifb does not have any kind of autodetection of your machine. +You have to add the following kernel parameters in your elilo.conf:: + + Macbook : + video=efifb:macbook + MacMini : + video=efifb:mini + Macbook Pro 15", iMac 17" : + video=efifb:i17 + Macbook Pro 17", iMac 20" : + video=efifb:i20 + +Accepted options: + +======= =========================================================== +nowc Don't map the framebuffer write combined. This can be used + to workaround side-effects and slowdowns on other CPU cores + when large amounts of console data are written. +======= =========================================================== + +Edgar Hucek diff --git a/Documentation/fb/efifb.txt b/Documentation/fb/efifb.txt deleted file mode 100644 index 1a85c1bdaf38..000000000000 --- a/Documentation/fb/efifb.txt +++ /dev/null @@ -1,37 +0,0 @@ - -What is efifb? -=============== - -This is a generic EFI platform driver for Intel based Apple computers. -efifb is only for EFI booted Intel Macs. - -Supported Hardware -================== - -iMac 17"/20" -Macbook -Macbook Pro 15"/17" -MacMini - -How to use it? -============== - -efifb does not have any kind of autodetection of your machine. -You have to add the following kernel parameters in your elilo.conf: - Macbook : - video=efifb:macbook - MacMini : - video=efifb:mini - Macbook Pro 15", iMac 17" : - video=efifb:i17 - Macbook Pro 17", iMac 20" : - video=efifb:i20 - -Accepted options: - -nowc Don't map the framebuffer write combined. This can be used - to workaround side-effects and slowdowns on other CPU cores - when large amounts of console data are written. - --- -Edgar Hucek diff --git a/Documentation/fb/ep93xx-fb.rst b/Documentation/fb/ep93xx-fb.rst new file mode 100644 index 000000000000..6f7767926d1a --- /dev/null +++ b/Documentation/fb/ep93xx-fb.rst @@ -0,0 +1,140 @@ +================================ +Driver for EP93xx LCD controller +================================ + +The EP93xx LCD controller can drive both standard desktop monitors and +embedded LCD displays. If you have a standard desktop monitor then you +can use the standard Linux video mode database. In your board file:: + + static struct ep93xxfb_mach_info some_board_fb_info = { + .num_modes = EP93XXFB_USE_MODEDB, + .bpp = 16, + }; + +If you have an embedded LCD display then you need to define a video +mode for it as follows:: + + static struct fb_videomode some_board_video_modes[] = { + { + .name = "some_lcd_name", + /* Pixel clock, porches, etc */ + }, + }; + +Note that the pixel clock value is in pico-seconds. You can use the +KHZ2PICOS macro to convert the pixel clock value. Most other values +are in pixel clocks. See Documentation/fb/framebuffer.rst for further +details. + +The ep93xxfb_mach_info structure for your board should look like the +following:: + + static struct ep93xxfb_mach_info some_board_fb_info = { + .num_modes = ARRAY_SIZE(some_board_video_modes), + .modes = some_board_video_modes, + .default_mode = &some_board_video_modes[0], + .bpp = 16, + }; + +The framebuffer device can be registered by adding the following to +your board initialisation function:: + + ep93xx_register_fb(&some_board_fb_info); + +===================== +Video Attribute Flags +===================== + +The ep93xxfb_mach_info structure has a flags field which can be used +to configure the controller. The video attributes flags are fully +documented in section 7 of the EP93xx users' guide. The following +flags are available: + +=============================== ========================================== +EP93XXFB_PCLK_FALLING Clock data on the falling edge of the + pixel clock. The default is to clock + data on the rising edge. + +EP93XXFB_SYNC_BLANK_HIGH Blank signal is active high. By + default the blank signal is active low. + +EP93XXFB_SYNC_HORIZ_HIGH Horizontal sync is active high. By + default the horizontal sync is active low. + +EP93XXFB_SYNC_VERT_HIGH Vertical sync is active high. By + default the vertical sync is active high. +=============================== ========================================== + +The physical address of the framebuffer can be controlled using the +following flags: + +=============================== ====================================== +EP93XXFB_USE_SDCSN0 Use SDCSn[0] for the framebuffer. This + is the default setting. + +EP93XXFB_USE_SDCSN1 Use SDCSn[1] for the framebuffer. + +EP93XXFB_USE_SDCSN2 Use SDCSn[2] for the framebuffer. + +EP93XXFB_USE_SDCSN3 Use SDCSn[3] for the framebuffer. +=============================== ====================================== + +================== +Platform callbacks +================== + +The EP93xx framebuffer driver supports three optional platform +callbacks: setup, teardown and blank. The setup and teardown functions +are called when the framebuffer driver is installed and removed +respectively. The blank function is called whenever the display is +blanked or unblanked. + +The setup and teardown devices pass the platform_device structure as +an argument. The fb_info and ep93xxfb_mach_info structures can be +obtained as follows:: + + static int some_board_fb_setup(struct platform_device *pdev) + { + struct ep93xxfb_mach_info *mach_info = pdev->dev.platform_data; + struct fb_info *fb_info = platform_get_drvdata(pdev); + + /* Board specific framebuffer setup */ + } + +====================== +Setting the video mode +====================== + +The video mode is set using the following syntax:: + + video=XRESxYRES[-BPP][@REFRESH] + +If the EP93xx video driver is built-in then the video mode is set on +the Linux kernel command line, for example:: + + video=ep93xx-fb:800x600-16@60 + +If the EP93xx video driver is built as a module then the video mode is +set when the module is installed:: + + modprobe ep93xx-fb video=320x240 + +============== +Screenpage bug +============== + +At least on the EP9315 there is a silicon bug which causes bit 27 of +the VIDSCRNPAGE (framebuffer physical offset) to be tied low. There is +an unofficial errata for this bug at:: + + http://marc.info/?l=linux-arm-kernel&m=110061245502000&w=2 + +By default the EP93xx framebuffer driver checks if the allocated physical +address has bit 27 set. If it does, then the memory is freed and an +error is returned. The check can be disabled by adding the following +option when loading the driver:: + + ep93xx-fb.check_screenpage_bug=0 + +In some cases it may be possible to reconfigure your SDRAM layout to +avoid this bug. See section 13 of the EP93xx users' guide for details. diff --git a/Documentation/fb/ep93xx-fb.txt b/Documentation/fb/ep93xx-fb.txt deleted file mode 100644 index 5af1bd9effae..000000000000 --- a/Documentation/fb/ep93xx-fb.txt +++ /dev/null @@ -1,135 +0,0 @@ -================================ -Driver for EP93xx LCD controller -================================ - -The EP93xx LCD controller can drive both standard desktop monitors and -embedded LCD displays. If you have a standard desktop monitor then you -can use the standard Linux video mode database. In your board file: - - static struct ep93xxfb_mach_info some_board_fb_info = { - .num_modes = EP93XXFB_USE_MODEDB, - .bpp = 16, - }; - -If you have an embedded LCD display then you need to define a video -mode for it as follows: - - static struct fb_videomode some_board_video_modes[] = { - { - .name = "some_lcd_name", - /* Pixel clock, porches, etc */ - }, - }; - -Note that the pixel clock value is in pico-seconds. You can use the -KHZ2PICOS macro to convert the pixel clock value. Most other values -are in pixel clocks. See Documentation/fb/framebuffer.txt for further -details. - -The ep93xxfb_mach_info structure for your board should look like the -following: - - static struct ep93xxfb_mach_info some_board_fb_info = { - .num_modes = ARRAY_SIZE(some_board_video_modes), - .modes = some_board_video_modes, - .default_mode = &some_board_video_modes[0], - .bpp = 16, - }; - -The framebuffer device can be registered by adding the following to -your board initialisation function: - - ep93xx_register_fb(&some_board_fb_info); - -===================== -Video Attribute Flags -===================== - -The ep93xxfb_mach_info structure has a flags field which can be used -to configure the controller. The video attributes flags are fully -documented in section 7 of the EP93xx users' guide. The following -flags are available: - -EP93XXFB_PCLK_FALLING Clock data on the falling edge of the - pixel clock. The default is to clock - data on the rising edge. - -EP93XXFB_SYNC_BLANK_HIGH Blank signal is active high. By - default the blank signal is active low. - -EP93XXFB_SYNC_HORIZ_HIGH Horizontal sync is active high. By - default the horizontal sync is active low. - -EP93XXFB_SYNC_VERT_HIGH Vertical sync is active high. By - default the vertical sync is active high. - -The physical address of the framebuffer can be controlled using the -following flags: - -EP93XXFB_USE_SDCSN0 Use SDCSn[0] for the framebuffer. This - is the default setting. - -EP93XXFB_USE_SDCSN1 Use SDCSn[1] for the framebuffer. - -EP93XXFB_USE_SDCSN2 Use SDCSn[2] for the framebuffer. - -EP93XXFB_USE_SDCSN3 Use SDCSn[3] for the framebuffer. - -================== -Platform callbacks -================== - -The EP93xx framebuffer driver supports three optional platform -callbacks: setup, teardown and blank. The setup and teardown functions -are called when the framebuffer driver is installed and removed -respectively. The blank function is called whenever the display is -blanked or unblanked. - -The setup and teardown devices pass the platform_device structure as -an argument. The fb_info and ep93xxfb_mach_info structures can be -obtained as follows: - - static int some_board_fb_setup(struct platform_device *pdev) - { - struct ep93xxfb_mach_info *mach_info = pdev->dev.platform_data; - struct fb_info *fb_info = platform_get_drvdata(pdev); - - /* Board specific framebuffer setup */ - } - -====================== -Setting the video mode -====================== - -The video mode is set using the following syntax: - - video=XRESxYRES[-BPP][@REFRESH] - -If the EP93xx video driver is built-in then the video mode is set on -the Linux kernel command line, for example: - - video=ep93xx-fb:800x600-16@60 - -If the EP93xx video driver is built as a module then the video mode is -set when the module is installed: - - modprobe ep93xx-fb video=320x240 - -============== -Screenpage bug -============== - -At least on the EP9315 there is a silicon bug which causes bit 27 of -the VIDSCRNPAGE (framebuffer physical offset) to be tied low. There is -an unofficial errata for this bug at: - http://marc.info/?l=linux-arm-kernel&m=110061245502000&w=2 - -By default the EP93xx framebuffer driver checks if the allocated physical -address has bit 27 set. If it does, then the memory is freed and an -error is returned. The check can be disabled by adding the following -option when loading the driver: - - ep93xx-fb.check_screenpage_bug=0 - -In some cases it may be possible to reconfigure your SDRAM layout to -avoid this bug. See section 13 of the EP93xx users' guide for details. diff --git a/Documentation/fb/fbcon.rst b/Documentation/fb/fbcon.rst new file mode 100644 index 000000000000..cfb9f7c38f18 --- /dev/null +++ b/Documentation/fb/fbcon.rst @@ -0,0 +1,350 @@ +======================= +The Framebuffer Console +======================= + +The framebuffer console (fbcon), as its name implies, is a text +console running on top of the framebuffer device. It has the functionality of +any standard text console driver, such as the VGA console, with the added +features that can be attributed to the graphical nature of the framebuffer. + +In the x86 architecture, the framebuffer console is optional, and +some even treat it as a toy. For other architectures, it is the only available +display device, text or graphical. + +What are the features of fbcon? The framebuffer console supports +high resolutions, varying font types, display rotation, primitive multihead, +etc. Theoretically, multi-colored fonts, blending, aliasing, and any feature +made available by the underlying graphics card are also possible. + +A. Configuration +================ + +The framebuffer console can be enabled by using your favorite kernel +configuration tool. It is under Device Drivers->Graphics Support->Frame +buffer Devices->Console display driver support->Framebuffer Console Support. +Select 'y' to compile support statically or 'm' for module support. The +module will be fbcon. + +In order for fbcon to activate, at least one framebuffer driver is +required, so choose from any of the numerous drivers available. For x86 +systems, they almost universally have VGA cards, so vga16fb and vesafb will +always be available. However, using a chipset-specific driver will give you +more speed and features, such as the ability to change the video mode +dynamically. + +To display the penguin logo, choose any logo available in Graphics +support->Bootup logo. + +Also, you will need to select at least one compiled-in font, but if +you don't do anything, the kernel configuration tool will select one for you, +usually an 8x16 font. + +GOTCHA: A common bug report is enabling the framebuffer without enabling the +framebuffer console. Depending on the driver, you may get a blanked or +garbled display, but the system still boots to completion. If you are +fortunate to have a driver that does not alter the graphics chip, then you +will still get a VGA console. + +B. Loading +========== + +Possible scenarios: + +1. Driver and fbcon are compiled statically + + Usually, fbcon will automatically take over your console. The notable + exception is vesafb. It needs to be explicitly activated with the + vga= boot option parameter. + +2. Driver is compiled statically, fbcon is compiled as a module + + Depending on the driver, you either get a standard console, or a + garbled display, as mentioned above. To get a framebuffer console, + do a 'modprobe fbcon'. + +3. Driver is compiled as a module, fbcon is compiled statically + + You get your standard console. Once the driver is loaded with + 'modprobe xxxfb', fbcon automatically takes over the console with + the possible exception of using the fbcon=map:n option. See below. + +4. Driver and fbcon are compiled as a module. + + You can load them in any order. Once both are loaded, fbcon will take + over the console. + +C. Boot options + + The framebuffer console has several, largely unknown, boot options + that can change its behavior. + +1. fbcon=font: + + Select the initial font to use. The value 'name' can be any of the + compiled-in fonts: 10x18, 6x10, 7x14, Acorn8x8, MINI4x6, + PEARL8x8, ProFont6x11, SUN12x22, SUN8x16, VGA8x16, VGA8x8. + + Note, not all drivers can handle font with widths not divisible by 8, + such as vga16fb. + +2. fbcon=scrollback:[k] + + The scrollback buffer is memory that is used to preserve display + contents that has already scrolled past your view. This is accessed + by using the Shift-PageUp key combination. The value 'value' is any + integer. It defaults to 32KB. The 'k' suffix is optional, and will + multiply the 'value' by 1024. + +3. fbcon=map:<0123> + + This is an interesting option. It tells which driver gets mapped to + which console. The value '0123' is a sequence that gets repeated until + the total length is 64 which is the number of consoles available. In + the above example, it is expanded to 012301230123... and the mapping + will be:: + + tty | 1 2 3 4 5 6 7 8 9 ... + fb | 0 1 2 3 0 1 2 3 0 ... + + ('cat /proc/fb' should tell you what the fb numbers are) + + One side effect that may be useful is using a map value that exceeds + the number of loaded fb drivers. For example, if only one driver is + available, fb0, adding fbcon=map:1 tells fbcon not to take over the + console. + + Later on, when you want to map the console the to the framebuffer + device, you can use the con2fbmap utility. + +4. fbcon=vc:- + + This option tells fbcon to take over only a range of consoles as + specified by the values 'n1' and 'n2'. The rest of the consoles + outside the given range will still be controlled by the standard + console driver. + + NOTE: For x86 machines, the standard console is the VGA console which + is typically located on the same video card. Thus, the consoles that + are controlled by the VGA console will be garbled. + +4. fbcon=rotate: + + This option changes the orientation angle of the console display. The + value 'n' accepts the following: + + - 0 - normal orientation (0 degree) + - 1 - clockwise orientation (90 degrees) + - 2 - upside down orientation (180 degrees) + - 3 - counterclockwise orientation (270 degrees) + + The angle can be changed anytime afterwards by 'echoing' the same + numbers to any one of the 2 attributes found in + /sys/class/graphics/fbcon: + + - rotate - rotate the display of the active console + - rotate_all - rotate the display of all consoles + + Console rotation will only become available if Framebuffer Console + Rotation support is compiled in your kernel. + + NOTE: This is purely console rotation. Any other applications that + use the framebuffer will remain at their 'normal' orientation. + Actually, the underlying fb driver is totally ignorant of console + rotation. + +5. fbcon=margin: + + This option specifies the color of the margins. The margins are the + leftover area at the right and the bottom of the screen that are not + used by text. By default, this area will be black. The 'color' value + is an integer number that depends on the framebuffer driver being used. + +6. fbcon=nodefer + + If the kernel is compiled with deferred fbcon takeover support, normally + the framebuffer contents, left in place by the firmware/bootloader, will + be preserved until there actually is some text is output to the console. + This option causes fbcon to bind immediately to the fbdev device. + +7. fbcon=logo-pos: + + The only possible 'location' is 'center' (without quotes), and when + given, the bootup logo is moved from the default top-left corner + location to the center of the framebuffer. If more than one logo is + displayed due to multiple CPUs, the collected line of logos is moved + as a whole. + +C. Attaching, Detaching and Unloading + +Before going on to how to attach, detach and unload the framebuffer console, an +illustration of the dependencies may help. + +The console layer, as with most subsystems, needs a driver that interfaces with +the hardware. Thus, in a VGA console:: + + console ---> VGA driver ---> hardware. + +Assuming the VGA driver can be unloaded, one must first unbind the VGA driver +from the console layer before unloading the driver. The VGA driver cannot be +unloaded if it is still bound to the console layer. (See +Documentation/console/console.txt for more information). + +This is more complicated in the case of the framebuffer console (fbcon), +because fbcon is an intermediate layer between the console and the drivers:: + + console ---> fbcon ---> fbdev drivers ---> hardware + +The fbdev drivers cannot be unloaded if bound to fbcon, and fbcon cannot +be unloaded if it's bound to the console layer. + +So to unload the fbdev drivers, one must first unbind fbcon from the console, +then unbind the fbdev drivers from fbcon. Fortunately, unbinding fbcon from +the console layer will automatically unbind framebuffer drivers from +fbcon. Thus, there is no need to explicitly unbind the fbdev drivers from +fbcon. + +So, how do we unbind fbcon from the console? Part of the answer is in +Documentation/console/console.txt. To summarize: + +Echo a value to the bind file that represents the framebuffer console +driver. So assuming vtcon1 represents fbcon, then:: + + echo 1 > sys/class/vtconsole/vtcon1/bind - attach framebuffer console to + console layer + echo 0 > sys/class/vtconsole/vtcon1/bind - detach framebuffer console from + console layer + +If fbcon is detached from the console layer, your boot console driver (which is +usually VGA text mode) will take over. A few drivers (rivafb and i810fb) will +restore VGA text mode for you. With the rest, before detaching fbcon, you +must take a few additional steps to make sure that your VGA text mode is +restored properly. The following is one of the several methods that you can do: + +1. Download or install vbetool. This utility is included with most + distributions nowadays, and is usually part of the suspend/resume tool. + +2. In your kernel configuration, ensure that CONFIG_FRAMEBUFFER_CONSOLE is set + to 'y' or 'm'. Enable one or more of your favorite framebuffer drivers. + +3. Boot into text mode and as root run:: + + vbetool vbestate save > + + The above command saves the register contents of your graphics + hardware to . You need to do this step only once as + the state file can be reused. + +4. If fbcon is compiled as a module, load fbcon by doing:: + + modprobe fbcon + +5. Now to detach fbcon:: + + vbetool vbestate restore < && \ + echo 0 > /sys/class/vtconsole/vtcon1/bind + +6. That's it, you're back to VGA mode. And if you compiled fbcon as a module, + you can unload it by 'rmmod fbcon'. + +7. To reattach fbcon:: + + echo 1 > /sys/class/vtconsole/vtcon1/bind + +8. Once fbcon is unbound, all drivers registered to the system will also +become unbound. This means that fbcon and individual framebuffer drivers +can be unloaded or reloaded at will. Reloading the drivers or fbcon will +automatically bind the console, fbcon and the drivers together. Unloading +all the drivers without unloading fbcon will make it impossible for the +console to bind fbcon. + +Notes for vesafb users: +======================= + +Unfortunately, if your bootline includes a vga=xxx parameter that sets the +hardware in graphics mode, such as when loading vesafb, vgacon will not load. +Instead, vgacon will replace the default boot console with dummycon, and you +won't get any display after detaching fbcon. Your machine is still alive, so +you can reattach vesafb. However, to reattach vesafb, you need to do one of +the following: + +Variation 1: + + a. Before detaching fbcon, do:: + + vbetool vbemode save > # do once for each vesafb mode, + # the file can be reused + + b. Detach fbcon as in step 5. + + c. Attach fbcon:: + + vbetool vbestate restore < && \ + echo 1 > /sys/class/vtconsole/vtcon1/bind + +Variation 2: + + a. Before detaching fbcon, do:: + + echo > /sys/class/tty/console/bind + + vbetool vbemode get + + b. Take note of the mode number + + b. Detach fbcon as in step 5. + + c. Attach fbcon:: + + vbetool vbemode set && \ + echo 1 > /sys/class/vtconsole/vtcon1/bind + +Samples: +======== + +Here are 2 sample bash scripts that you can use to bind or unbind the +framebuffer console driver if you are on an X86 box:: + + #!/bin/bash + # Unbind fbcon + + # Change this to where your actual vgastate file is located + # Or Use VGASTATE=$1 to indicate the state file at runtime + VGASTATE=/tmp/vgastate + + # path to vbetool + VBETOOL=/usr/local/bin + + + for (( i = 0; i < 16; i++)) + do + if test -x /sys/class/vtconsole/vtcon$i; then + if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ + = 1 ]; then + if test -x $VBETOOL/vbetool; then + echo Unbinding vtcon$i + $VBETOOL/vbetool vbestate restore < $VGASTATE + echo 0 > /sys/class/vtconsole/vtcon$i/bind + fi + fi + fi + done + +--------------------------------------------------------------------------- + +:: + + #!/bin/bash + # Bind fbcon + + for (( i = 0; i < 16; i++)) + do + if test -x /sys/class/vtconsole/vtcon$i; then + if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ + = 1 ]; then + echo Unbinding vtcon$i + echo 1 > /sys/class/vtconsole/vtcon$i/bind + fi + fi + done + +Antonino Daplas diff --git a/Documentation/fb/fbcon.txt b/Documentation/fb/fbcon.txt deleted file mode 100644 index 60a5ec04e8f0..000000000000 --- a/Documentation/fb/fbcon.txt +++ /dev/null @@ -1,347 +0,0 @@ -The Framebuffer Console -======================= - - The framebuffer console (fbcon), as its name implies, is a text -console running on top of the framebuffer device. It has the functionality of -any standard text console driver, such as the VGA console, with the added -features that can be attributed to the graphical nature of the framebuffer. - - In the x86 architecture, the framebuffer console is optional, and -some even treat it as a toy. For other architectures, it is the only available -display device, text or graphical. - - What are the features of fbcon? The framebuffer console supports -high resolutions, varying font types, display rotation, primitive multihead, -etc. Theoretically, multi-colored fonts, blending, aliasing, and any feature -made available by the underlying graphics card are also possible. - -A. Configuration - - The framebuffer console can be enabled by using your favorite kernel -configuration tool. It is under Device Drivers->Graphics Support->Frame -buffer Devices->Console display driver support->Framebuffer Console Support. -Select 'y' to compile support statically or 'm' for module support. The -module will be fbcon. - - In order for fbcon to activate, at least one framebuffer driver is -required, so choose from any of the numerous drivers available. For x86 -systems, they almost universally have VGA cards, so vga16fb and vesafb will -always be available. However, using a chipset-specific driver will give you -more speed and features, such as the ability to change the video mode -dynamically. - - To display the penguin logo, choose any logo available in Graphics -support->Bootup logo. - - Also, you will need to select at least one compiled-in font, but if -you don't do anything, the kernel configuration tool will select one for you, -usually an 8x16 font. - -GOTCHA: A common bug report is enabling the framebuffer without enabling the -framebuffer console. Depending on the driver, you may get a blanked or -garbled display, but the system still boots to completion. If you are -fortunate to have a driver that does not alter the graphics chip, then you -will still get a VGA console. - -B. Loading - -Possible scenarios: - -1. Driver and fbcon are compiled statically - - Usually, fbcon will automatically take over your console. The notable - exception is vesafb. It needs to be explicitly activated with the - vga= boot option parameter. - -2. Driver is compiled statically, fbcon is compiled as a module - - Depending on the driver, you either get a standard console, or a - garbled display, as mentioned above. To get a framebuffer console, - do a 'modprobe fbcon'. - -3. Driver is compiled as a module, fbcon is compiled statically - - You get your standard console. Once the driver is loaded with - 'modprobe xxxfb', fbcon automatically takes over the console with - the possible exception of using the fbcon=map:n option. See below. - -4. Driver and fbcon are compiled as a module. - - You can load them in any order. Once both are loaded, fbcon will take - over the console. - -C. Boot options - - The framebuffer console has several, largely unknown, boot options - that can change its behavior. - -1. fbcon=font: - - Select the initial font to use. The value 'name' can be any of the - compiled-in fonts: 10x18, 6x10, 7x14, Acorn8x8, MINI4x6, - PEARL8x8, ProFont6x11, SUN12x22, SUN8x16, VGA8x16, VGA8x8. - - Note, not all drivers can handle font with widths not divisible by 8, - such as vga16fb. - -2. fbcon=scrollback:[k] - - The scrollback buffer is memory that is used to preserve display - contents that has already scrolled past your view. This is accessed - by using the Shift-PageUp key combination. The value 'value' is any - integer. It defaults to 32KB. The 'k' suffix is optional, and will - multiply the 'value' by 1024. - -3. fbcon=map:<0123> - - This is an interesting option. It tells which driver gets mapped to - which console. The value '0123' is a sequence that gets repeated until - the total length is 64 which is the number of consoles available. In - the above example, it is expanded to 012301230123... and the mapping - will be: - - tty | 1 2 3 4 5 6 7 8 9 ... - fb | 0 1 2 3 0 1 2 3 0 ... - - ('cat /proc/fb' should tell you what the fb numbers are) - - One side effect that may be useful is using a map value that exceeds - the number of loaded fb drivers. For example, if only one driver is - available, fb0, adding fbcon=map:1 tells fbcon not to take over the - console. - - Later on, when you want to map the console the to the framebuffer - device, you can use the con2fbmap utility. - -4. fbcon=vc:- - - This option tells fbcon to take over only a range of consoles as - specified by the values 'n1' and 'n2'. The rest of the consoles - outside the given range will still be controlled by the standard - console driver. - - NOTE: For x86 machines, the standard console is the VGA console which - is typically located on the same video card. Thus, the consoles that - are controlled by the VGA console will be garbled. - -4. fbcon=rotate: - - This option changes the orientation angle of the console display. The - value 'n' accepts the following: - - 0 - normal orientation (0 degree) - 1 - clockwise orientation (90 degrees) - 2 - upside down orientation (180 degrees) - 3 - counterclockwise orientation (270 degrees) - - The angle can be changed anytime afterwards by 'echoing' the same - numbers to any one of the 2 attributes found in - /sys/class/graphics/fbcon: - - rotate - rotate the display of the active console - rotate_all - rotate the display of all consoles - - Console rotation will only become available if Framebuffer Console - Rotation support is compiled in your kernel. - - NOTE: This is purely console rotation. Any other applications that - use the framebuffer will remain at their 'normal' orientation. - Actually, the underlying fb driver is totally ignorant of console - rotation. - -5. fbcon=margin: - - This option specifies the color of the margins. The margins are the - leftover area at the right and the bottom of the screen that are not - used by text. By default, this area will be black. The 'color' value - is an integer number that depends on the framebuffer driver being used. - -6. fbcon=nodefer - - If the kernel is compiled with deferred fbcon takeover support, normally - the framebuffer contents, left in place by the firmware/bootloader, will - be preserved until there actually is some text is output to the console. - This option causes fbcon to bind immediately to the fbdev device. - -7. fbcon=logo-pos: - - The only possible 'location' is 'center' (without quotes), and when - given, the bootup logo is moved from the default top-left corner - location to the center of the framebuffer. If more than one logo is - displayed due to multiple CPUs, the collected line of logos is moved - as a whole. - -C. Attaching, Detaching and Unloading - -Before going on to how to attach, detach and unload the framebuffer console, an -illustration of the dependencies may help. - -The console layer, as with most subsystems, needs a driver that interfaces with -the hardware. Thus, in a VGA console: - -console ---> VGA driver ---> hardware. - -Assuming the VGA driver can be unloaded, one must first unbind the VGA driver -from the console layer before unloading the driver. The VGA driver cannot be -unloaded if it is still bound to the console layer. (See -Documentation/console/console.txt for more information). - -This is more complicated in the case of the framebuffer console (fbcon), -because fbcon is an intermediate layer between the console and the drivers: - -console ---> fbcon ---> fbdev drivers ---> hardware - -The fbdev drivers cannot be unloaded if bound to fbcon, and fbcon cannot -be unloaded if it's bound to the console layer. - -So to unload the fbdev drivers, one must first unbind fbcon from the console, -then unbind the fbdev drivers from fbcon. Fortunately, unbinding fbcon from -the console layer will automatically unbind framebuffer drivers from -fbcon. Thus, there is no need to explicitly unbind the fbdev drivers from -fbcon. - -So, how do we unbind fbcon from the console? Part of the answer is in -Documentation/console/console.txt. To summarize: - -Echo a value to the bind file that represents the framebuffer console -driver. So assuming vtcon1 represents fbcon, then: - -echo 1 > sys/class/vtconsole/vtcon1/bind - attach framebuffer console to - console layer -echo 0 > sys/class/vtconsole/vtcon1/bind - detach framebuffer console from - console layer - -If fbcon is detached from the console layer, your boot console driver (which is -usually VGA text mode) will take over. A few drivers (rivafb and i810fb) will -restore VGA text mode for you. With the rest, before detaching fbcon, you -must take a few additional steps to make sure that your VGA text mode is -restored properly. The following is one of the several methods that you can do: - -1. Download or install vbetool. This utility is included with most - distributions nowadays, and is usually part of the suspend/resume tool. - -2. In your kernel configuration, ensure that CONFIG_FRAMEBUFFER_CONSOLE is set - to 'y' or 'm'. Enable one or more of your favorite framebuffer drivers. - -3. Boot into text mode and as root run: - - vbetool vbestate save > - - The above command saves the register contents of your graphics - hardware to . You need to do this step only once as - the state file can be reused. - -4. If fbcon is compiled as a module, load fbcon by doing: - - modprobe fbcon - -5. Now to detach fbcon: - - vbetool vbestate restore < && \ - echo 0 > /sys/class/vtconsole/vtcon1/bind - -6. That's it, you're back to VGA mode. And if you compiled fbcon as a module, - you can unload it by 'rmmod fbcon'. - -7. To reattach fbcon: - - echo 1 > /sys/class/vtconsole/vtcon1/bind - -8. Once fbcon is unbound, all drivers registered to the system will also -become unbound. This means that fbcon and individual framebuffer drivers -can be unloaded or reloaded at will. Reloading the drivers or fbcon will -automatically bind the console, fbcon and the drivers together. Unloading -all the drivers without unloading fbcon will make it impossible for the -console to bind fbcon. - -Notes for vesafb users: -======================= - -Unfortunately, if your bootline includes a vga=xxx parameter that sets the -hardware in graphics mode, such as when loading vesafb, vgacon will not load. -Instead, vgacon will replace the default boot console with dummycon, and you -won't get any display after detaching fbcon. Your machine is still alive, so -you can reattach vesafb. However, to reattach vesafb, you need to do one of -the following: - -Variation 1: - - a. Before detaching fbcon, do - - vbetool vbemode save > # do once for each vesafb mode, - # the file can be reused - - b. Detach fbcon as in step 5. - - c. Attach fbcon - - vbetool vbestate restore < && \ - echo 1 > /sys/class/vtconsole/vtcon1/bind - -Variation 2: - - a. Before detaching fbcon, do: - echo > /sys/class/tty/console/bind - - - vbetool vbemode get - - b. Take note of the mode number - - b. Detach fbcon as in step 5. - - c. Attach fbcon: - - vbetool vbemode set && \ - echo 1 > /sys/class/vtconsole/vtcon1/bind - -Samples: -======== - -Here are 2 sample bash scripts that you can use to bind or unbind the -framebuffer console driver if you are on an X86 box: - ---------------------------------------------------------------------------- -#!/bin/bash -# Unbind fbcon - -# Change this to where your actual vgastate file is located -# Or Use VGASTATE=$1 to indicate the state file at runtime -VGASTATE=/tmp/vgastate - -# path to vbetool -VBETOOL=/usr/local/bin - - -for (( i = 0; i < 16; i++)) -do - if test -x /sys/class/vtconsole/vtcon$i; then - if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ - = 1 ]; then - if test -x $VBETOOL/vbetool; then - echo Unbinding vtcon$i - $VBETOOL/vbetool vbestate restore < $VGASTATE - echo 0 > /sys/class/vtconsole/vtcon$i/bind - fi - fi - fi -done - ---------------------------------------------------------------------------- -#!/bin/bash -# Bind fbcon - -for (( i = 0; i < 16; i++)) -do - if test -x /sys/class/vtconsole/vtcon$i; then - if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ - = 1 ]; then - echo Unbinding vtcon$i - echo 1 > /sys/class/vtconsole/vtcon$i/bind - fi - fi -done ---------------------------------------------------------------------------- - --- -Antonino Daplas diff --git a/Documentation/fb/framebuffer.rst b/Documentation/fb/framebuffer.rst new file mode 100644 index 000000000000..7fe087310c82 --- /dev/null +++ b/Documentation/fb/framebuffer.rst @@ -0,0 +1,353 @@ +======================= +The Frame Buffer Device +======================= + +Last revised: May 10, 2001 + + +0. Introduction +--------------- + +The frame buffer device provides an abstraction for the graphics hardware. It +represents the frame buffer of some video hardware and allows application +software to access the graphics hardware through a well-defined interface, so +the software doesn't need to know anything about the low-level (hardware +register) stuff. + +The device is accessed through special device nodes, usually located in the +/dev directory, i.e. /dev/fb*. + + +1. User's View of /dev/fb* +-------------------------- + +From the user's point of view, the frame buffer device looks just like any +other device in /dev. It's a character device using major 29; the minor +specifies the frame buffer number. + +By convention, the following device nodes are used (numbers indicate the device +minor numbers):: + + 0 = /dev/fb0 First frame buffer + 1 = /dev/fb1 Second frame buffer + ... + 31 = /dev/fb31 32nd frame buffer + +For backwards compatibility, you may want to create the following symbolic +links:: + + /dev/fb0current -> fb0 + /dev/fb1current -> fb1 + +and so on... + +The frame buffer devices are also `normal` memory devices, this means, you can +read and write their contents. You can, for example, make a screen snapshot by:: + + cp /dev/fb0 myfile + +There also can be more than one frame buffer at a time, e.g. if you have a +graphics card in addition to the built-in hardware. The corresponding frame +buffer devices (/dev/fb0 and /dev/fb1 etc.) work independently. + +Application software that uses the frame buffer device (e.g. the X server) will +use /dev/fb0 by default (older software uses /dev/fb0current). You can specify +an alternative frame buffer device by setting the environment variable +$FRAMEBUFFER to the path name of a frame buffer device, e.g. (for sh/bash +users):: + + export FRAMEBUFFER=/dev/fb1 + +or (for csh users):: + + setenv FRAMEBUFFER /dev/fb1 + +After this the X server will use the second frame buffer. + + +2. Programmer's View of /dev/fb* +-------------------------------- + +As you already know, a frame buffer device is a memory device like /dev/mem and +it has the same features. You can read it, write it, seek to some location in +it and mmap() it (the main usage). The difference is just that the memory that +appears in the special file is not the whole memory, but the frame buffer of +some video hardware. + +/dev/fb* also allows several ioctls on it, by which lots of information about +the hardware can be queried and set. The color map handling works via ioctls, +too. Look into for more information on what ioctls exist and on +which data structures they work. Here's just a brief overview: + + - You can request unchangeable information about the hardware, like name, + organization of the screen memory (planes, packed pixels, ...) and address + and length of the screen memory. + + - You can request and change variable information about the hardware, like + visible and virtual geometry, depth, color map format, timing, and so on. + If you try to change that information, the driver maybe will round up some + values to meet the hardware's capabilities (or return EINVAL if that isn't + possible). + + - You can get and set parts of the color map. Communication is done with 16 + bits per color part (red, green, blue, transparency) to support all + existing hardware. The driver does all the computations needed to apply + it to the hardware (round it down to less bits, maybe throw away + transparency). + +All this hardware abstraction makes the implementation of application programs +easier and more portable. E.g. the X server works completely on /dev/fb* and +thus doesn't need to know, for example, how the color registers of the concrete +hardware are organized. XF68_FBDev is a general X server for bitmapped, +unaccelerated video hardware. The only thing that has to be built into +application programs is the screen organization (bitplanes or chunky pixels +etc.), because it works on the frame buffer image data directly. + +For the future it is planned that frame buffer drivers for graphics cards and +the like can be implemented as kernel modules that are loaded at runtime. Such +a driver just has to call register_framebuffer() and supply some functions. +Writing and distributing such drivers independently from the kernel will save +much trouble... + + +3. Frame Buffer Resolution Maintenance +-------------------------------------- + +Frame buffer resolutions are maintained using the utility `fbset`. It can +change the video mode properties of a frame buffer device. Its main usage is +to change the current video mode, e.g. during boot up in one of your `/etc/rc.*` +or `/etc/init.d/*` files. + +Fbset uses a video mode database stored in a configuration file, so you can +easily add your own modes and refer to them with a simple identifier. + + +4. The X Server +--------------- + +The X server (XF68_FBDev) is the most notable application program for the frame +buffer device. Starting with XFree86 release 3.2, the X server is part of +XFree86 and has 2 modes: + + - If the `Display` subsection for the `fbdev` driver in the /etc/XF86Config + file contains a:: + + Modes "default" + + line, the X server will use the scheme discussed above, i.e. it will start + up in the resolution determined by /dev/fb0 (or $FRAMEBUFFER, if set). You + still have to specify the color depth (using the Depth keyword) and virtual + resolution (using the Virtual keyword) though. This is the default for the + configuration file supplied with XFree86. It's the most simple + configuration, but it has some limitations. + + - Therefore it's also possible to specify resolutions in the /etc/XF86Config + file. This allows for on-the-fly resolution switching while retaining the + same virtual desktop size. The frame buffer device that's used is still + /dev/fb0current (or $FRAMEBUFFER), but the available resolutions are + defined by /etc/XF86Config now. The disadvantage is that you have to + specify the timings in a different format (but `fbset -x` may help). + +To tune a video mode, you can use fbset or xvidtune. Note that xvidtune doesn't +work 100% with XF68_FBDev: the reported clock values are always incorrect. + + +5. Video Mode Timings +--------------------- + +A monitor draws an image on the screen by using an electron beam (3 electron +beams for color models, 1 electron beam for monochrome monitors). The front of +the screen is covered by a pattern of colored phosphors (pixels). If a phosphor +is hit by an electron, it emits a photon and thus becomes visible. + +The electron beam draws horizontal lines (scanlines) from left to right, and +from the top to the bottom of the screen. By modifying the intensity of the +electron beam, pixels with various colors and intensities can be shown. + +After each scanline the electron beam has to move back to the left side of the +screen and to the next line: this is called the horizontal retrace. After the +whole screen (frame) was painted, the beam moves back to the upper left corner: +this is called the vertical retrace. During both the horizontal and vertical +retrace, the electron beam is turned off (blanked). + +The speed at which the electron beam paints the pixels is determined by the +dotclock in the graphics board. For a dotclock of e.g. 28.37516 MHz (millions +of cycles per second), each pixel is 35242 ps (picoseconds) long:: + + 1/(28.37516E6 Hz) = 35.242E-9 s + +If the screen resolution is 640x480, it will take:: + + 640*35.242E-9 s = 22.555E-6 s + +to paint the 640 (xres) pixels on one scanline. But the horizontal retrace +also takes time (e.g. 272 `pixels`), so a full scanline takes:: + + (640+272)*35.242E-9 s = 32.141E-6 s + +We'll say that the horizontal scanrate is about 31 kHz:: + + 1/(32.141E-6 s) = 31.113E3 Hz + +A full screen counts 480 (yres) lines, but we have to consider the vertical +retrace too (e.g. 49 `lines`). So a full screen will take:: + + (480+49)*32.141E-6 s = 17.002E-3 s + +The vertical scanrate is about 59 Hz:: + + 1/(17.002E-3 s) = 58.815 Hz + +This means the screen data is refreshed about 59 times per second. To have a +stable picture without visible flicker, VESA recommends a vertical scanrate of +at least 72 Hz. But the perceived flicker is very human dependent: some people +can use 50 Hz without any trouble, while I'll notice if it's less than 80 Hz. + +Since the monitor doesn't know when a new scanline starts, the graphics board +will supply a synchronization pulse (horizontal sync or hsync) for each +scanline. Similarly it supplies a synchronization pulse (vertical sync or +vsync) for each new frame. The position of the image on the screen is +influenced by the moments at which the synchronization pulses occur. + +The following picture summarizes all timings. The horizontal retrace time is +the sum of the left margin, the right margin and the hsync length, while the +vertical retrace time is the sum of the upper margin, the lower margin and the +vsync length:: + + +----------+---------------------------------------------+----------+-------+ + | | ↑ | | | + | | |upper_margin | | | + | | ↓ | | | + +----------###############################################----------+-------+ + | # ↑ # | | + | # | # | | + | # | # | | + | # | # | | + | left # | # right | hsync | + | margin # | xres # margin | len | + |<-------->#<---------------+--------------------------->#<-------->|<----->| + | # | # | | + | # | # | | + | # | # | | + | # |yres # | | + | # | # | | + | # | # | | + | # | # | | + | # | # | | + | # | # | | + | # | # | | + | # | # | | + | # | # | | + | # ↓ # | | + +----------###############################################----------+-------+ + | | ↑ | | | + | | |lower_margin | | | + | | ↓ | | | + +----------+---------------------------------------------+----------+-------+ + | | ↑ | | | + | | |vsync_len | | | + | | ↓ | | | + +----------+---------------------------------------------+----------+-------+ + +The frame buffer device expects all horizontal timings in number of dotclocks +(in picoseconds, 1E-12 s), and vertical timings in number of scanlines. + + +6. Converting XFree86 timing values info frame buffer device timings +-------------------------------------------------------------------- + +An XFree86 mode line consists of the following fields:: + + "800x600" 50 800 856 976 1040 600 637 643 666 + < name > DCF HR SH1 SH2 HFL VR SV1 SV2 VFL + +The frame buffer device uses the following fields: + + - pixclock: pixel clock in ps (pico seconds) + - left_margin: time from sync to picture + - right_margin: time from picture to sync + - upper_margin: time from sync to picture + - lower_margin: time from picture to sync + - hsync_len: length of horizontal sync + - vsync_len: length of vertical sync + +1) Pixelclock: + + xfree: in MHz + + fb: in picoseconds (ps) + + pixclock = 1000000 / DCF + +2) horizontal timings: + + left_margin = HFL - SH2 + + right_margin = SH1 - HR + + hsync_len = SH2 - SH1 + +3) vertical timings: + + upper_margin = VFL - SV2 + + lower_margin = SV1 - VR + + vsync_len = SV2 - SV1 + +Good examples for VESA timings can be found in the XFree86 source tree, +under "xc/programs/Xserver/hw/xfree86/doc/modeDB.txt". + + +7. References +------------- + +For more specific information about the frame buffer device and its +applications, please refer to the Linux-fbdev website: + + http://linux-fbdev.sourceforge.net/ + +and to the following documentation: + + - The manual pages for fbset: fbset(8), fb.modes(5) + - The manual pages for XFree86: XF68_FBDev(1), XF86Config(4/5) + - The mighty kernel sources: + + - linux/drivers/video/ + - linux/include/linux/fb.h + - linux/include/video/ + + + +8. Mailing list +--------------- + +There is a frame buffer device related mailing list at kernel.org: +linux-fbdev@vger.kernel.org. + +Point your web browser to http://sourceforge.net/projects/linux-fbdev/ for +subscription information and archive browsing. + + +9. Downloading +-------------- + +All necessary files can be found at + + ftp://ftp.uni-erlangen.de/pub/Linux/LOCAL/680x0/ + +and on its mirrors. + +The latest version of fbset can be found at + + http://www.linux-fbdev.org/ + + +10. Credits +----------- + +This readme was written by Geert Uytterhoeven, partly based on the original +`X-framebuffer.README` by Roman Hodek and Martin Schaller. Section 6 was +provided by Frank Neumann. + +The frame buffer device abstraction was designed by Martin Schaller. diff --git a/Documentation/fb/framebuffer.txt b/Documentation/fb/framebuffer.txt deleted file mode 100644 index 58c5ae2e9f59..000000000000 --- a/Documentation/fb/framebuffer.txt +++ /dev/null @@ -1,343 +0,0 @@ - The Frame Buffer Device - ----------------------- - -Maintained by Geert Uytterhoeven -Last revised: May 10, 2001 - - -0. Introduction ---------------- - -The frame buffer device provides an abstraction for the graphics hardware. It -represents the frame buffer of some video hardware and allows application -software to access the graphics hardware through a well-defined interface, so -the software doesn't need to know anything about the low-level (hardware -register) stuff. - -The device is accessed through special device nodes, usually located in the -/dev directory, i.e. /dev/fb*. - - -1. User's View of /dev/fb* --------------------------- - -From the user's point of view, the frame buffer device looks just like any -other device in /dev. It's a character device using major 29; the minor -specifies the frame buffer number. - -By convention, the following device nodes are used (numbers indicate the device -minor numbers): - - 0 = /dev/fb0 First frame buffer - 1 = /dev/fb1 Second frame buffer - ... - 31 = /dev/fb31 32nd frame buffer - -For backwards compatibility, you may want to create the following symbolic -links: - - /dev/fb0current -> fb0 - /dev/fb1current -> fb1 - -and so on... - -The frame buffer devices are also `normal' memory devices, this means, you can -read and write their contents. You can, for example, make a screen snapshot by - - cp /dev/fb0 myfile - -There also can be more than one frame buffer at a time, e.g. if you have a -graphics card in addition to the built-in hardware. The corresponding frame -buffer devices (/dev/fb0 and /dev/fb1 etc.) work independently. - -Application software that uses the frame buffer device (e.g. the X server) will -use /dev/fb0 by default (older software uses /dev/fb0current). You can specify -an alternative frame buffer device by setting the environment variable -$FRAMEBUFFER to the path name of a frame buffer device, e.g. (for sh/bash -users): - - export FRAMEBUFFER=/dev/fb1 - -or (for csh users): - - setenv FRAMEBUFFER /dev/fb1 - -After this the X server will use the second frame buffer. - - -2. Programmer's View of /dev/fb* --------------------------------- - -As you already know, a frame buffer device is a memory device like /dev/mem and -it has the same features. You can read it, write it, seek to some location in -it and mmap() it (the main usage). The difference is just that the memory that -appears in the special file is not the whole memory, but the frame buffer of -some video hardware. - -/dev/fb* also allows several ioctls on it, by which lots of information about -the hardware can be queried and set. The color map handling works via ioctls, -too. Look into for more information on what ioctls exist and on -which data structures they work. Here's just a brief overview: - - - You can request unchangeable information about the hardware, like name, - organization of the screen memory (planes, packed pixels, ...) and address - and length of the screen memory. - - - You can request and change variable information about the hardware, like - visible and virtual geometry, depth, color map format, timing, and so on. - If you try to change that information, the driver maybe will round up some - values to meet the hardware's capabilities (or return EINVAL if that isn't - possible). - - - You can get and set parts of the color map. Communication is done with 16 - bits per color part (red, green, blue, transparency) to support all - existing hardware. The driver does all the computations needed to apply - it to the hardware (round it down to less bits, maybe throw away - transparency). - -All this hardware abstraction makes the implementation of application programs -easier and more portable. E.g. the X server works completely on /dev/fb* and -thus doesn't need to know, for example, how the color registers of the concrete -hardware are organized. XF68_FBDev is a general X server for bitmapped, -unaccelerated video hardware. The only thing that has to be built into -application programs is the screen organization (bitplanes or chunky pixels -etc.), because it works on the frame buffer image data directly. - -For the future it is planned that frame buffer drivers for graphics cards and -the like can be implemented as kernel modules that are loaded at runtime. Such -a driver just has to call register_framebuffer() and supply some functions. -Writing and distributing such drivers independently from the kernel will save -much trouble... - - -3. Frame Buffer Resolution Maintenance --------------------------------------- - -Frame buffer resolutions are maintained using the utility `fbset'. It can -change the video mode properties of a frame buffer device. Its main usage is -to change the current video mode, e.g. during boot up in one of your /etc/rc.* -or /etc/init.d/* files. - -Fbset uses a video mode database stored in a configuration file, so you can -easily add your own modes and refer to them with a simple identifier. - - -4. The X Server ---------------- - -The X server (XF68_FBDev) is the most notable application program for the frame -buffer device. Starting with XFree86 release 3.2, the X server is part of -XFree86 and has 2 modes: - - - If the `Display' subsection for the `fbdev' driver in the /etc/XF86Config - file contains a - - Modes "default" - - line, the X server will use the scheme discussed above, i.e. it will start - up in the resolution determined by /dev/fb0 (or $FRAMEBUFFER, if set). You - still have to specify the color depth (using the Depth keyword) and virtual - resolution (using the Virtual keyword) though. This is the default for the - configuration file supplied with XFree86. It's the most simple - configuration, but it has some limitations. - - - Therefore it's also possible to specify resolutions in the /etc/XF86Config - file. This allows for on-the-fly resolution switching while retaining the - same virtual desktop size. The frame buffer device that's used is still - /dev/fb0current (or $FRAMEBUFFER), but the available resolutions are - defined by /etc/XF86Config now. The disadvantage is that you have to - specify the timings in a different format (but `fbset -x' may help). - -To tune a video mode, you can use fbset or xvidtune. Note that xvidtune doesn't -work 100% with XF68_FBDev: the reported clock values are always incorrect. - - -5. Video Mode Timings ---------------------- - -A monitor draws an image on the screen by using an electron beam (3 electron -beams for color models, 1 electron beam for monochrome monitors). The front of -the screen is covered by a pattern of colored phosphors (pixels). If a phosphor -is hit by an electron, it emits a photon and thus becomes visible. - -The electron beam draws horizontal lines (scanlines) from left to right, and -from the top to the bottom of the screen. By modifying the intensity of the -electron beam, pixels with various colors and intensities can be shown. - -After each scanline the electron beam has to move back to the left side of the -screen and to the next line: this is called the horizontal retrace. After the -whole screen (frame) was painted, the beam moves back to the upper left corner: -this is called the vertical retrace. During both the horizontal and vertical -retrace, the electron beam is turned off (blanked). - -The speed at which the electron beam paints the pixels is determined by the -dotclock in the graphics board. For a dotclock of e.g. 28.37516 MHz (millions -of cycles per second), each pixel is 35242 ps (picoseconds) long: - - 1/(28.37516E6 Hz) = 35.242E-9 s - -If the screen resolution is 640x480, it will take - - 640*35.242E-9 s = 22.555E-6 s - -to paint the 640 (xres) pixels on one scanline. But the horizontal retrace -also takes time (e.g. 272 `pixels'), so a full scanline takes - - (640+272)*35.242E-9 s = 32.141E-6 s - -We'll say that the horizontal scanrate is about 31 kHz: - - 1/(32.141E-6 s) = 31.113E3 Hz - -A full screen counts 480 (yres) lines, but we have to consider the vertical -retrace too (e.g. 49 `lines'). So a full screen will take - - (480+49)*32.141E-6 s = 17.002E-3 s - -The vertical scanrate is about 59 Hz: - - 1/(17.002E-3 s) = 58.815 Hz - -This means the screen data is refreshed about 59 times per second. To have a -stable picture without visible flicker, VESA recommends a vertical scanrate of -at least 72 Hz. But the perceived flicker is very human dependent: some people -can use 50 Hz without any trouble, while I'll notice if it's less than 80 Hz. - -Since the monitor doesn't know when a new scanline starts, the graphics board -will supply a synchronization pulse (horizontal sync or hsync) for each -scanline. Similarly it supplies a synchronization pulse (vertical sync or -vsync) for each new frame. The position of the image on the screen is -influenced by the moments at which the synchronization pulses occur. - -The following picture summarizes all timings. The horizontal retrace time is -the sum of the left margin, the right margin and the hsync length, while the -vertical retrace time is the sum of the upper margin, the lower margin and the -vsync length. - - +----------+---------------------------------------------+----------+-------+ - | | ↑ | | | - | | |upper_margin | | | - | | ↓ | | | - +----------###############################################----------+-------+ - | # ↑ # | | - | # | # | | - | # | # | | - | # | # | | - | left # | # right | hsync | - | margin # | xres # margin | len | - |<-------->#<---------------+--------------------------->#<-------->|<----->| - | # | # | | - | # | # | | - | # | # | | - | # |yres # | | - | # | # | | - | # | # | | - | # | # | | - | # | # | | - | # | # | | - | # | # | | - | # | # | | - | # | # | | - | # ↓ # | | - +----------###############################################----------+-------+ - | | ↑ | | | - | | |lower_margin | | | - | | ↓ | | | - +----------+---------------------------------------------+----------+-------+ - | | ↑ | | | - | | |vsync_len | | | - | | ↓ | | | - +----------+---------------------------------------------+----------+-------+ - -The frame buffer device expects all horizontal timings in number of dotclocks -(in picoseconds, 1E-12 s), and vertical timings in number of scanlines. - - -6. Converting XFree86 timing values info frame buffer device timings --------------------------------------------------------------------- - -An XFree86 mode line consists of the following fields: - "800x600" 50 800 856 976 1040 600 637 643 666 - < name > DCF HR SH1 SH2 HFL VR SV1 SV2 VFL - -The frame buffer device uses the following fields: - - - pixclock: pixel clock in ps (pico seconds) - - left_margin: time from sync to picture - - right_margin: time from picture to sync - - upper_margin: time from sync to picture - - lower_margin: time from picture to sync - - hsync_len: length of horizontal sync - - vsync_len: length of vertical sync - -1) Pixelclock: - xfree: in MHz - fb: in picoseconds (ps) - - pixclock = 1000000 / DCF - -2) horizontal timings: - left_margin = HFL - SH2 - right_margin = SH1 - HR - hsync_len = SH2 - SH1 - -3) vertical timings: - upper_margin = VFL - SV2 - lower_margin = SV1 - VR - vsync_len = SV2 - SV1 - -Good examples for VESA timings can be found in the XFree86 source tree, -under "xc/programs/Xserver/hw/xfree86/doc/modeDB.txt". - - -7. References -------------- - -For more specific information about the frame buffer device and its -applications, please refer to the Linux-fbdev website: - - http://linux-fbdev.sourceforge.net/ - -and to the following documentation: - - - The manual pages for fbset: fbset(8), fb.modes(5) - - The manual pages for XFree86: XF68_FBDev(1), XF86Config(4/5) - - The mighty kernel sources: - o linux/drivers/video/ - o linux/include/linux/fb.h - o linux/include/video/ - - - -8. Mailing list ---------------- - -There is a frame buffer device related mailing list at kernel.org: -linux-fbdev@vger.kernel.org. - -Point your web browser to http://sourceforge.net/projects/linux-fbdev/ for -subscription information and archive browsing. - - -9. Downloading --------------- - -All necessary files can be found at - - ftp://ftp.uni-erlangen.de/pub/Linux/LOCAL/680x0/ - -and on its mirrors. - -The latest version of fbset can be found at - - http://www.linux-fbdev.org/ - - -10. Credits ----------- - -This readme was written by Geert Uytterhoeven, partly based on the original -`X-framebuffer.README' by Roman Hodek and Martin Schaller. Section 6 was -provided by Frank Neumann. - -The frame buffer device abstraction was designed by Martin Schaller. diff --git a/Documentation/fb/gxfb.rst b/Documentation/fb/gxfb.rst new file mode 100644 index 000000000000..5738709bccbb --- /dev/null +++ b/Documentation/fb/gxfb.rst @@ -0,0 +1,54 @@ +============= +What is gxfb? +============= + +.. [This file is cloned from VesaFB/aty128fb] + +This is a graphics framebuffer driver for AMD Geode GX2 based processors. + +Advantages: + + * No need to use AMD's VSA code (or other VESA emulation layer) in the + BIOS. + * It provides a nice large console (128 cols + 48 lines with 1024x768) + without using tiny, unreadable fonts. + * You can run XF68_FBDev on top of /dev/fb0 + * Most important: boot logo :-) + +Disadvantages: + + * graphic mode is slower than text mode... + + +How to use it? +============== + +Switching modes is done using gxfb.mode_option=... boot +parameter or using `fbset` program. + +See Documentation/fb/modedb.rst for more information on modedb +resolutions. + + +X11 +=== + +XF68_FBDev should generally work fine, but it is non-accelerated. + + +Configuration +============= + +You can pass kernel command line options to gxfb with gxfb.