第 4 章 磁盘和文件系统

In Chapter 3, we discussed some of the top-level disk devices that the kernel makes available. In this chapter, we’ll discuss in detail how to work with disks on a Linux system. You’ll learn how to partition disks, create and maintain the filesystems that go inside disk partitions, and work with swap space.

在第三章中,我们讨论了内核提供的一些顶层磁盘设备。

在本章中,我们将详细讨论如何在Linux系统中使用磁盘。

您将学习如何分区磁盘,创建和维护磁盘分区内的文件系统,并处理交换空间。

Recall that disk devices have names like /dev/sda, the first SCSI subsystem disk. This kind of block device represents the entire disk, but there are many different components and layers inside a disk.

请记住,磁盘设备的名称类似于/dev/sda,表示第一个SCSI子系统磁盘。

这种块设备表示整个磁盘,但磁盘内部有许多不同的组件和层次。

Figure 4-1 illustrates the schematic of a typical Linux disk (note that the figure is not to scale). As you progress through this chapter, you’ll learn where each piece fits in.

图4-1展示了典型Linux磁盘的示意图(请注意,图中的比例不准确)。

随着您阅读本章,您将了解每个组件的位置。

Figure 4-1. Typical Linux disk schematic

Figure 4-1. Typical Linux disk schematic

图 4-1. 典型的 Linux 磁盘原理图

Partitions are subdivisions of the whole disk. On Linux, they’re denoted with a number after the whole block device, and therefore have device names such as /dev/sda1 and /dev/sdb3. The kernel presents each partition as a block device, just as it would an entire disk. Partitions are defined on a small area of the disk called a partition table.

分区是整个磁盘的子分区。

在Linux上,它们以整个块设备后面的数字表示,因此具有设备名称,如/dev/sda1和/dev/sdb3。

内核将每个分区呈现为块设备,就像它对待整个磁盘一样。

分区定义在一个称为分区表的小区域上。

NOTE
Multiple data partitions were once common on systems with large disks because older PCs could boot only from certain parts of the disk. Also, administrators used partitions to reserve a certain amount of space for operating system areas; for example, they didn’t want users to be able to fill up the entire system and prevent critical services from working. This practice is not unique to Unix; you’ll still find many new Windows systems with several partitions on a single disk. In addition, most systems have a separate swap partition.

注意
在拥有大容量硬盘的系统上,多个数据分区曾经很常见,因为旧的个人电脑只能从硬盘的某些部分启动。

此外,管理员使用分区来保留一定的空间给操作系统区域;例如,他们不希望用户能够填满整个系统并阻止关键服务的运行。

这种做法并不仅限于Unix;你仍然会在许多新的Windows系统上找到一个硬盘上有几个分区的情况。此外,大多数系统都有一个单独的交换分区。

Although the kernel makes it possible for you to access both an entire disk and one of its partitions at the same time, you would not normally do so, unless you were copying the entire disk.

尽管内核使您能够同时访问整个磁盘和其分区,但通常情况下,除非您要复制整个磁盘,否则不会这样做。

The next layer after the partition is the filesystem, the database of files and directories that you’re accustomed to interacting with in user space. We’ll explore filesystems in 4.2 Filesystems.

分区之后的下一层是文件系统,即您在用户空间中习惯与之交互的文件和目录的数据库。

我们将在4.2文件系统中探讨文件系统。

As you can see in Figure 4-1, if you want to access the data in a file, you need to get the appropriate partition location from the partition table and then search the filesystem database on that partition for the desired file data.

如图4-1所示,如果您想访问文件中的数据,您需要从分区表中获取相应的分区位置,然后在该分区的文件系统数据库中搜索所需的文件数据。

To access data on a disk, the Linux kernel uses the system of layers shown in Figure 4-2. The SCSI subsystem and everything else described in 3.6 In-Depth: SCSI and the Linux Kernel are represented by a single box. (Notice that you can work with the disk through the filesystem as well as directly through the disk devices. You’ll do both in this chapter.)

为了访问磁盘上的数据,Linux内核使用了图4-2所示的层次系统。

SCSI子系统和3.6深入解析:SCSI和Linux内核中描述的其他内容都由一个单独的框表示。

(请注意,您可以通过文件系统以及直接通过磁盘设备来处理磁盘。

在本章中,您将两者都会遇到。)

To get a handle on how everything fits together, let’s start at the bottom with partitions.

为了全面了解各个部分之间的关系,让我们从底层的分区开始。

Figure 4-2. Kernel schematic for disk access

Figure 4-2. Kernel schematic for disk access

图 4-2. 磁盘访问内核示意图

4.1 Partitioning Disk Devices(对磁盘设备进行分区)

There are many kinds of partition tables. The traditional table is the one found inside the Master Boot Record (MBR). A newer standard starting to gain traction is the Globally Unique Identifier Partition Table (GPT).

分区表有很多种。传统的分区表是主引导记录 (MBR) 中的分区表。

(MBR) 中的分区表。全球唯一标识符分区表(GPT)是一种较新的标准,已开始受到重视。

Here is an overview of the many Linux partitioning tools available:

下面概述了许多可用的 Linux 分区工具:

o parted A text-based tool that supports both MBR and GPT.
o gparted A graphical version of parted.
o fdisk The traditional text-based Linux disk partitioning tool. fdisk does not support GPT.
o gdisk A version of fdisk that supports GPT but not MBR. Because it supports both MBR and GPT, we’ll use parted in this book. However, many people prefer the fdisk interface, and there’s nothing wrong with that

o parted 基于文本的工具,支持 MBR 和 GPT。
o gparted 图形版本的 parted。
o fdisk 传统的基于文本的 Linux 磁盘分区工具。
o gdisk 支持 GPT 但不支持 MBR 的 fdisk 版本。

由于它同时支持 MBR 和 GPT,我们在本书中将使用 parted。不过,很多人更喜欢 fdisk 界面,这也无可厚非。

NOTE

Although parted can create and resize filesystems, you shouldn’t use it for filesystem manipulation because you can easily get confused. There is a critical difference between partitioning and filesystem manipulation. The partition table defines simple boundaries on the disk, whereas a filesystem is a much more involved data system. For this reason, we’ll use parted for partitioning but use separate utilities for creating filesystems (see 4.2.2 Creating a Filesystem). Even the parted documentation encourages you to create filesystems separately.

注意
尽管parted可以创建和调整文件系统,但你不应该使用它进行文件系统操作,因为容易混淆。分区和文件系统操作之间存在重要的区别。

分区表定义了磁盘上的简单边界,而文件系统是一个更复杂的数据系统。

因此,我们将使用parted进行分区,但使用单独的工具来创建文件系统(参见4.2.2创建文件系统)。

即使parted的文档也鼓励你单独创建文件系统。

4.1.1 Viewing a Partition Table(查看分区表)

You can view your system’s partition table with parted -l. Here is sample output from two disk devices
with two different kinds of partition tables:

您可以使用parted -l命令查看系统的分区表。

下面是两个不同类型分区表的两个磁盘设备的示例输出:

# parted -l
Model: ATA WDC WD3200AAJS-2 (scsi)
Disk /dev/sda: 320GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 1049kB 316GB 316GB primary ext4 boot
2 316GB 320GB 4235MB extended
5 316GB 320GB 4235MB logical linux-swap(v1)
Model: FLASH Drive UT_USB20 (scsi)
Disk /dev/sdf: 4041MB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 17.4kB 1000MB 1000MB myfirst
2 1000MB 4040MB 3040MB mysecond

The first device, /dev/sda, uses the traditional MBR partition table (called “msdos” by parted), and the second contains a GPT table. Notice that there are different parameters for each partition table, because the tables themselves are different. In particular, there is no Name column for the MBR table because names don’t exist under that scheme. (I arbitrarily chose the names myfirst and mysecond in the GPT table.)

第一个设备 /dev/sda 使用传统的 MBR 分区表(parted 称为 "msdos"),第二个设备包含 GPT 表。

请注意,每个分区表都有不同的参数,因为表本身是不同的。

特别是,MBR 表没有 "名称 "列,因为在该方案下不存在名称。

(我在 GPT 表中随意选择了 myfirst 和 mysecond 这两个名称)。

The MBR table in this example contains primary, extended, and logical partitions. A primary partition is a normal subdivision of the disk; partition 1 is such a partition. The basic MBR has a limit of four primary partitions, so if you want more than four, you designate one partition as an extended partition. Next, you subdivide the extended partition into logical partitions that the operating system can use as it would any other partition. In this example, partition 2 is an extended partition that contains logical partition 5

本例中的 MBR 表包含主分区、扩展分区和逻辑分区。

主分区是磁盘的一个普通分区;分区 1 就是这样一个分区。

基本 MBR 有四个主分区的限制,所以如果你想要超过四个分区,就需要指定一个分区为扩展分区。

然后,将扩展分区细分为逻辑分区,操作系统可以像使用其他分区一样使用这些逻辑分区。

在本例中,分区 2 是一个包含逻辑分区 5 的扩展分区

NOTE The filesystem that parted lists is not necessarily the system ID field defined in most MBR entries. The MBR system ID is just a number; for example, 83 is a Linux partition and 82 is Linux swap. Therefore, parted attempts to determine a filesystem on its own. If you absolutely must know the system ID for an MBR, use fdisk -l

注意

分区列出的文件系统不一定是大多数 MBR 条目中定义的系统 ID 字段。

MBR 系统 ID 只是一个数字;例如,83 是 Linux 分区,82 是 Linux swap。

因此,parted 会尝试自行确定文件系统。如果必须知道 MBR 的系统 ID,请使用 fdisk -l

Initial Kernel Read(初始内核读取)

When initially reading the MBR table, the Linux kernel produces the following debugging output (remember that you can view this with dmesg):

最初读取 MBR 表时,Linux 内核会产生以下调试输出(请记住,您可以使用 dmesg 查看):

sda: sda1 sda2 < sda5 >

The sda2 < sda5 > output indicates that /dev/sda2 is an extended partition containing one logical partition, /dev/sda5. You’ll normally ignore extended partitions because you’ll typically want to access only the logical partitions inside.

sda2 < sda5 > 输出显示 /dev/sda2 是一个扩展分区,包含一个逻辑分区 /dev/sda5。

你通常会忽略扩展分区,因为你通常只想访问里面的逻辑分区。

4.1.2 Changing Partition Tables(更改分区表)

Viewing partition tables is a relatively simple and harmless operation. Altering partition tables is also relatively easy, but there are risks involved in making this kind of change to the disk. Keep the following in mind:

查看分区表是一项相对简单且无害的操作。

更改分区表也相对简单 但对磁盘进行此类更改存在风险。请牢记以下几点:

o Changing the partition table makes it quite difficult to recover any data on partitions that you delete because it changes the initial point of reference for a filesystem. Make sure that you have a backup if the disk you’re partitioning contains critical data.
o Ensure that no partitions on your target disk are currently in use. This is a concern because most Linux distributions automatically mount any detected filesystem. (See 4.2.3 Mounting a Filesystem for more on mounting and unmounting.)

o 更改分区表会导致很难恢复被删除分区上的任何数据,因为这会更改文件系统的初始参考点。如果要分区的磁盘包含重要数据,请确保有备份。
o 确保目标磁盘上没有当前正在使用的分区。

这是一个问题,因为大多数 Linux 发行版都会自动挂载任何检测到的文件系统。(有关挂载和卸载的更多信息,请参阅 4.2.3 挂载文件系统)。

When you’re ready, choose your partitioning program. If you’d like to use parted, you can use the command-line parted utility or a graphical interface such as gparted; for an fdisk-style interface, use gdisk if you’re using GPT partitioning. These utilities all have online help and are easy to learn. (Try using them on a flash device or something similar if you don’t have any spare disks.)

准备就绪后,选择分区程序。如果想使用 parted,可以使用命令行 parted 工具或图形界面(如 gparted);如果想使用 fdisk 风格的界面,可以使用 gdisk(如果使用 GPT 分区)。

这些实用程序都有在线帮助,很容易上手。

(如果没有备用磁盘,可以尝试在闪存设备或类似设备上使用它们)。

That said, there is a major difference in the way that fdisk and parted work. With fdisk, you design your new partition table before making the actual changes to the disk; fdisk only makes the changes as you exit the program. But with parted, partitions are created, modified, and removed as you issue the commands. You don’t get the chance to review the partition table before you change it

不过,fdisk 和 parted 的工作方式有很大不同。

使用 fdisk 时,在对磁盘进行实际更改之前,你需要设计新的分区表;fdisk 只会在你退出程序时进行更改。

但使用 parted 时,分区会在你发出命令时创建、修改和删除。在更改之前,你没有机会查看分区表。

These differences are also important to understanding how these two utilities interact with the kernel. Both fdisk and parted modify the partitions entirely in user space; there is no need to provide kernel support for rewriting a partition table because user space can read and modify all of a block device.

这些差异对于理解这两个工具如何与内核交互也很重要。

fdisk 和 parted 都完全在用户空间修改分区;由于用户空间可以读取和修改块设备的所有内容,因此不需要为重写分区表提供内核支持。

Eventually, though, the kernel must read the partition table in order to present the partitions as block devices. The fdisk utility uses a relatively simple method: After modifying the partition table, fdisk issues a single system call on the disk to tell the kernel that it should reread the partition table. The kernel then generates debugging output that you can view with dmesg. For example, if you create two partitions on /dev/sdf, you’ll see this

但最终,内核必须读取分区表,以便将分区显示为块设备。

fdisk 实用程序使用一种相对简单的方法: 修改分区表后,fdisk 会在磁盘上发出一个系统调用,告诉内核应该重新读取分区表。

内核随后会生成调试输出,你可以用 dmesg 查看。

例如,如果你在 /dev/sdf 上创建了两个分区,你会看到下面的内容

sdf:sdf1 sdf2

In comparison, the parted tools do not use this disk-wide system call. Instead, they signal the kernel when individual partitions are altered. After processing a single partition change, the kernel does not produce the preceding debugging output.

相比之下,parted工具不使用这种全盘的系统调用。

相反,它们在更改单个分区时向内核发出信号。在处理完单个分区更改后,内核不会产生前面的调试输出。

There are a few ways to see the partition changes:

有几种方法可以查看分区更改:

o Use udevadm to watch the kernel event changes. For example, udevadm monitor --kernel will show the old partition devices being removed and the new ones being added.
o Check /proc/partitions for full partition information.
o Check /sys/block/device/ for altered partition system interfaces or /dev for altered partition devices.

  • 使用udevadm监视内核事件的更改。
  • 例如,udevadm monitor --kernel将显示旧的分区设备被删除和新的分区设备被添加。
  • 检查/proc/partitions以获取完整的分区信息。
  • 检查/sys/block/device/以查看更改的分区系统接口或/dev以查看更改的分区设备。

If you absolutely must be sure that you have modified a partition table, you can perform the old-style system call that fdisk uses by using the blockdev command. For example, to force the kernel to reload the partition table on /dev/sdf, run this:

如果您确实需要确保已修改分区表,可以使用blockdev命令执行fdisk使用的旧式系统调用。

例如,要强制内核重新加载/dev/sdf上的分区表,请运行以下命令:

# blockdev --rereadpt /dev/sdf

至此,您已经了解了有关分区磁盘的所有必要信息。

但是,如果您对磁盘的更多细节感兴趣,请继续阅读。否则,请跳到4.2文件系统,了解如何在磁盘上放置文件系统。

4.1.3 Disk and Partition Geometry(磁盘和分区几何结构)

Any device with moving parts introduces complexity into a software system because there are physical elements that resist abstraction. A hard disk is no exception; even though you can think of a hard disk as a block device with random access to any block, there are serious performance consequences if you aren’t careful about how you lay out data on the disk. Consider the physical properties of the simple single-platter disk illustrated in Figure 4-3.

任何带有机械部件的设备都会在软件系统中引入复杂性,因为存在着无法抽象的物理元素。

硬盘也不例外;即使你可以将硬盘视为具有对任何块的随机访问的块设备,但如果你在硬盘上的数据布局上不小心,会导致严重的性能后果。

考虑图4-3所示的简单单盘硬盘的物理特性。

The disk consists of a spinning platter on a spindle, with a head attached to a moving arm that can sweep across the radius of the disk. As the disk spins underneath the head, the head reads data. When the arm is in one position, the head can only read data from a fixed circle. This circle is called a cylinder because larger disks have more than one platter, all stacked and spinning around the same spindle. Each platter can have one or two heads, for the top and/or bottom of the platter, and all heads are attached to the same arm and move in concert. Because the arm moves, there are many cylinders on the disk, from small ones around the center to large ones around the periphery of the disk. Finally, you can divide a cylinder into slices called sectors. This way of thinking about the disk geometry is called CHS, for cylinder-head-sector.

硬盘由一个固定在主轴上的旋转盘片和一个连接在移动臂上的磁头组成,该臂可以在盘片半径上扫过。当盘片在磁头下旋转时,磁头读取数据。

当臂处于某个位置时,磁头只能从一个固定的圆圈读取数据。

这个圆圈被称为柱面,因为较大的硬盘有多个盘片,都堆叠在同一个主轴上旋转。

每个盘片可以有一个或两个磁头,用于盘片的顶部和/或底部,所有磁头都连接在同一个臂上并协同移动。

由于臂的移动,硬盘上有许多柱面,从中心附近的小柱面到盘片周边的大柱面。

最后,你可以将柱面划分为称为扇区的片段。这种对硬盘几何结构的思考方式被称为柱面-磁头-扇区(CHS)。

Figure 4-3. Top-down view of a hard disk

NOTE A track is a part of a cylinder that a single head accesses, so in Figure 4-3, a cylinder is also a track. You probably don’t need to worry about tracks.

注意:一个磁道是一个磁头访问的柱面的一部分,所以在图4-3中,柱面也是一个磁道。

你可能不需要担心磁道。

The kernel and the various partitioning programs can tell you what a disk reports as its number of cylinders (and sectors, which are slices of cylinders). However, on a modern hard disk, the reported values are fiction! The traditional addressing scheme that uses CHS doesn’t scale with modern disk hardware, nor does it account for the fact that you can put more data into outer cylinders than inner cylinders. Disk hardware supports Logical Block Addressing (LBA) to simply address a location on the disk by a block number, but remnants of CHS remain. For example, the MBR partition table contains CHS information as well as LBA equivalents, and some boot loaders are still dumb enough to believe the CHS values (don’t worry—most Linux boot loaders use the LBA values)

内核和各种分区程序可以告诉你磁盘报告的柱面数(以及扇区,即柱面的切片)。

然而,在现代硬盘上,报告的值是虚构的!使用CHS的传统寻址方案无法与现代磁盘硬件相适应,也无法考虑到外部柱面可以容纳更多数据的事实。

磁盘硬件支持逻辑块寻址(LBA),通过块号简单地寻址磁盘上的位置,但CHS的遗留物仍然存在。

例如,MBR分区表包含CHS信息和LBA等效信息,一些引导加载程序仍然愚蠢到相信CHS的值(不用担心,大多数Linux引导加载程序使用LBA值)。

Nevertheless, the idea of cylinders has been important to partitioning because cylinders are ideal boundaries for partitions. Reading a data stream from a cylinder is very fast because the head can continuously pick up data as the disk spins. A partition arranged as a set of adjacent cylinders also allows for fast continuous data access because the head doesn’t need to move very far between cylinders.

尽管如此,柱面的概念对于分区来说是重要的,因为柱面是分区的理想边界。

从柱面读取数据流非常快,因为磁头可以在磁盘旋转时连续获取数据。

作为一组相邻柱面排列的分区也允许快速连续数据访问,因为磁头在柱面之间移动的距离不需要很远。

Some partitioning programs complain if you don’t place your partitions precisely on cylinder boundaries. Ignore this; there’s little you can do because the reported CHS values of modern disks simply aren’t true. The disk’s LBA scheme ensures that your partitions are where they’re supposed to be.

一些分区程序会抱怨如果你没有将分区精确地放在柱面边界上。

忽略这一点;因为现代磁盘报告的CHS值根本不准确,你几乎无能为力。

磁盘的LBA方案确保你的分区在它们应该在的位置上。

4.1.4 固态硬盘(SSD)

Storage devices with no moving parts, such as solid-state disks (SSDs), are radically different from spinning disks in terms of their access characteristics. For these, random access is not a problem because there’s no head to sweep across a platter, but certain factors affect performance.

没有机械部件的存储设备,例如固态硬盘(SSD),在访问特性上与旋转硬盘截然不同。

对于这些设备,随机访问并不是一个问题,因为没有磁头需要在盘片上扫过,但某些因素会影响性能。

One of the most significant factors affecting the performance of SSDs is partition alignment. When you read data from an SSD, you read it in chunks— typically 4096 bytes at a time—and the read must begin at a multiple of that same size. So if your partition and its data do not lie on a 4096-byte boundary, you may have to do two reads instead of one for small, common operations, such as reading the contents of a directory

影响固态硬盘性能的最重要因素之一是分区对齐。

从固态硬盘读取数据时,是以块为单位的,通常一次读取 4096 字节,而且读取必须从相同大小的倍数开始。

因此,如果你的分区及其数据不在 4096 字节的边界上,你可能需要进行两次读取,而不是一次读取来完成小型的普通操作,例如读取目录的内容。

Many partitioning utilities (parted and gparted, for example) include functionality to put newly created partitions at the proper offsets from the beginning of the disks, so you may never need to worry about improper partition alignment. However, if you’re curious about where your partitions begin and just want to make sure that they begin on a boundary, you can easily find this information by looking in /sys/block. Here’s an example for a partition /dev/sdf2:

许多分区工具(例如parted和gparted)包括将新创建的分区放置在磁盘开头正确偏移的功能,所以你可能永远不需要担心不正确的分区对齐。

然而,如果你想知道你的分区从哪里开始,并且只是想确保它们从边界开始,你可以通过查看/sys/block中的信息轻松找到这些信息。

下面是一个/dev/sdf2分区的示例:

$ cat /sys/block/sdf/sdf2/start
1953126

This partition starts at 1,953,126 bytes from the beginning of the disk. Because this number is not divisible by 4,096, the partition would not be attaining optimal performance if it were on SSD.

这个分区从磁盘开头的1,953,126字节处开始。

由于这个数字不能被4,096整除,如果它在SSD上,该分区将无法达到最佳性能

4.2 Filesystems(文件系统)

The last link between the kernel and user space for disks is typically the file-system; this is what you’re accustomed to interacting with when you run commands such as ls and cd. As previously mentioned, the filesystem is a form of database; it supplies the structure to transform a simple block device into the sophisticated hierarchy of files and subdirectories that users can understand.

对于磁盘来说,内核与用户空间之间的最后一个连接通常是文件系统;当你运行诸如 ls 和 cd 等命令时,你习惯于与文件系统进行交互。

正如先前提到的,文件系统是一种数据库形式;它提供了将简单的块设备转化为用户可以理解的复杂文件和子目录层次结构的结构。

At one time, filesystems resided on disks and other physical media used exclusively for data storage. However, the tree-like directory structure and I/O interface of filesystems are quite versatile, so filesystems now perform a variety of tasks, such as the system interfaces that you see in /sys and /proc. Filesystems are also traditionally implemented in the kernel, but the innovation of 9P from Plan 9 (http://plan9.bell-labs.com/sys/doc/9.html) has inspired the development of user-space filesystems. The File System in User Space (FUSE) feature allows user-space filesystems in Linux

曾经,文件系统存储在磁盘和其他专门用于数据存储的物理介质上。

然而,文件系统的树状目录结构和 I/O 接口非常灵活,因此现在文件系统执行各种任务,例如您在 /sys 和 /proc 中看到的系统接口。

文件系统通常也是在内核中实现的,但 Plan 9 的 9P(http://plan9.bell-labs.com/sys/doc/9.html)的创新推动了用户空间文件系统的发展。

用户空间文件系统(File System in User Space,FUSE)功能允许在 Linux 中使用用户空间文件系统。

The Virtual File System (VFS) abstraction layer completes the filesystem implementation. Much as the SCSI subsystem standardizes communication between different device types and kernel control commands, VFS ensures that all filesystem implementations support a standard interface so that user-space applications access files and directories in the same manner. VFS support has enabled Linux to support an extraordinarily large number of filesystems

虚拟文件系统(Virtual File System,VFS)抽象层完成了文件系统的实现。

就像 SCSI 子系统标准化不同设备类型和内核控制命令之间的通信一样,VFS 确保所有文件系统实现都支持标准接口,以便用户空间应用程序以相同的方式访问文件和目录。

VFS 的支持使得 Linux 能够支持非常多的文件系统。

4.2.1 Filesystem Types(文件系统类型)

Linux filesystem support includes native designs optimized for Linux, foreign types such as the Windows FAT family, universal filesystems like ISO 9660, and many others. The following list includes the most common types of filesystems for data storage. The type names as recognized by Linux are in parentheses next to the filesystem names.

Linux文件系统支持包括针对Linux进行优化的本机设计,如Windows FAT系列等外部类型,像ISO 9660这样的通用文件系统,以及其他许多类型。

下面的列表包括用于数据存储的最常见的文件系统类型。Linux识别的类型名称在文件系统名称旁边用括号表示。

  • The Fourth Extended filesystem (ext4) is the current iteration of a line of filesystems native to Linux. The Second Extended filesystem (ext2) was a longtime default for Linux systems inspired by traditional Unix filesystems such as the Unix File System (UFS) and the Fast File System (FFS). The Third Extended filesystem (ext3) added a journal feature (a small cache outside the normal filesystem data structure) to enhance data integrity and hasten booting. The ext4 filesystem is an incremental improvement with support for larger files than ext2 or ext3 support and a greater number of subdirectories.
  • 第四代扩展文件系统(ext4)是Linux本机文件系统系列的当前版本。第二代扩展文件系统(ext2)是Linux系统的长期默认文件系统,受传统Unix文件系统(如Unix文件系统(UFS)和快速文件系统(FFS))的启发。第三代扩展文件系统(ext3)在常规文件系统数据结构之外增加了日志功能(一个小缓存),以增强数据完整性和加快启动速度。ext4文件系统是一种增量改进,支持比ext2或ext3更大的文件和更多的子目录。

There is a certain amount of backward compatibility in the extended filesystem series. For example, you can mount ext2 and ext3 filesystems as each other, and you can mount ext2 and ext3 filesystems as ext4, but you cannot mount ext4 as ext2 or ext3.

扩展文件系统系列具有一定的向后兼容性。

例如,您可以将ext2和ext3文件系统挂载为彼此,也可以将ext2和ext3文件系统挂载为ext4,但不能将ext4挂载为ext2或ext3。

  • ISO 9660 (iso9660) is a CD-ROM standard. Most CD-ROMs use some variety of the ISO 9660 standard.
  • ISO 9660(iso9660)是一种CD-ROM标准。大多数CD-ROM使用ISO 9660标准的某种变体。
  • FAT filesystems (msdos, vfat, umsdos) pertain to Microsoft systems. The simple msdos type supports the very primitive monocase variety in MS-DOS systems. For most modern Windows filesystems, you should use the vfat filesystem in order to get full access from Linux. The rarely used umsdos filesystem is peculiar to Linux. It supports Unix features such as symbolic links on top of an MS-DOS filesystem.
  • FAT文件系统(msdos,vfat,umsdos)适用于Microsoft系统。简单的msdos类型支持MS-DOS系统中的非常原始的单大小写变体。

    • 对于大多数现代Windows文件系统,您应该使用vfat文件系统以便从Linux获得完全访问权限。
    • 很少使用的umsdos文件系统是Linux特有的。
    • 它在MS-DOS文件系统之上支持符号链接等Unix功能。
  • HFS+ (hfsplus) is an Apple standard used on most Macintosh systems.
  • HFS+(hfsplus)是苹果标准,在大多数Macintosh系统上使用。

Although the Extended filesystem series has been perfectly acceptable to most casual users, many advances have been made in filesystem technology that even ext4 cannot utilize due to the backward compatibility requirement. The advances are primarily in scalability enhancements pertaining to very large numbers of files, large files, and similar scenarios. New Linux filesystems, such as Btrfs, are under development and may be poised to replace the Extended series.

虽然扩展文件系统系列对于大多数普通用户来说已经完全可接受,但是由于向后兼容性要求,即使是ext4也无法利用文件系统技术的许多进展。

这些进展主要体现在与大量文件、大文件和类似情况相关的可扩展性增强方面。

新的Linux文件系统,如Btrfs,正在开发中,并有可能取代扩展系列。

4.2.2 Creating a Filesystem(创建文件系统)

Once you’re done with the partitioning process described in 4.1 Partitioning Disk Devices, you’re ready to create filesystems. As with partitioning, you’ll do this in user space because a user-space process can directly access and manipulate a block device. The mkfs utility can create many kinds of filesystems. For example, you can create an ext4 partition on /dev/sdf2 with this command:

在完成4.1节中描述的磁盘设备分区过程后,您可以开始创建文件系统。

与分区一样,您将在用户空间中进行此操作,因为用户空间进程可以直接访问和操作块设备。mkfs实用程序可以创建多种类型的文件系统。

例如,您可以使用以下命令在/dev/sdf2上创建一个ext4分区:

mkfs -t ext4 /dev/sdf2

The mkfs program automatically determines the number of blocks in a device and sets some reasonable defaults. Unless you really know what you’re doing and feel like reading the documentation in detail, don’t change these.

mkfs程序会自动确定设备上的块数并设置一些合理的默认值。除非您真的知道自己在做什么并愿意详细阅读文档,否则不要更改这些默认值。

When you create a filesystem, mkfs prints diagnostic output as it works, including output pertaining to the superblock. The superblock is a key component at the top level of the filesystem database, and it’s so important that mkfs creates a number of backups in case the original is destroyed. Consider recording a few of the superblock backup numbers when mkfs runs, in case you need to recover the superblock in the event of a disk failure (see 4.2.11 Checking and Repairing Filesystems).

创建文件系统时,mkfs会在工作过程中打印诊断输出,包括与超级块相关的输出。

超级块是文件系统数据库的顶层关键组件,它非常重要,以至于mkfs会创建多个备份以防原始超级块被破坏。

考虑在mkfs运行时记录一些超级块备份号码,以防在磁盘故障时需要恢复超级块(参见4.2.11节 检查和修复文件系统)。

WARNING

Filesystem creation is a task that you should only need to perform after adding a new disk or repartitioning an old one. You should create a filesystem just once for each new partition that has no preexisting data (or that has data that you want to remove). Creating a new filesystem on top of an existing filesystem will effectively destroy the old data.

警告
文件系统的创建是一个任务,你只需要在添加新的磁盘或重新分区旧磁盘之后才需要执行。
你应该为每个新的分区只创建一次文件系统,这个分区没有预先存在的数据(或者有你想要删除的数据)。
在现有的文件系统上创建一个新的文件系统将会有效地销毁旧数据。

It turns out that mkfs is only a frontend for a series of filesystem creation programs, mkfs.fs, where fs is a filesystem type. So when you run mkfs -t ext4, mkfs in turn runs mkfs.ext4

原来,mkfs只是一系列文件系统创建程序的前端,其中fs是文件系统类型。

所以当你运行mkfs -t ext4时,mkfs实际上会运行mkfs.ext4。

And there’s even more indirection. Inspect the mkfs.* files behind the commands and you’ll see the following:

而且还存在更多的间接性。

检查命令后面的mkfs.*文件,你会看到以下内容。

$ ls -l /sbin/mkfs.*
-rwxr-xr-x 1 root root 17896 Mar 29 21:49 /sbin/mkfs.bfs
-rwxr-xr-x 1 root root 30280 Mar 29 21:49 /sbin/mkfs.cramfs
lrwxrwxrwx 1 root root 6 Mar 30 13:25 /sbin/mkfs.ext2 -> mke2fs
lrwxrwxrwx 1 root root 6 Mar 30 13:25 /sbin/mkfs.ext3 -> mke2fs
lrwxrwxrwx 1 root root 6 Mar 30 13:25 /sbin/mkfs.ext4 -> mke2fs
lrwxrwxrwx 1 root root 6 Mar 30 13:25 /sbin/mkfs.ext4dev -> mke2fs
-rwxr-xr-x 1 root root 26200 Mar 29 21:49 /sbin/mkfs.minix
lrwxrwxrwx 1 root root 7 Dec 19 2011 /sbin/mkfs.msdos -> mkdosfs
lrwxrwxrwx 1 root root 6 Mar 5 2012 /sbin/mkfs.ntfs -> mkntfs
lrwxrwxrwx 1 root root 7 Dec 19 2011 /sbin/mkfs.vfat -> mkdosfs

As you can see, mkfs.ext4 is just a symbolic link to mke2fs. This is important to remember if you run across a system without a specific mkfs command or when you’re looking up the documentation for a particular filesystem. Each filesystem’s creation utility has its own manual page, like mke2fs(8). This shouldn’t be a problem on most systems, because accessing the mkfs.ext4(8) manual page should redirect you to the mke2fs(8) manual page, but keep it in mind.

正如你所看到的,mkfs.ext4只是mke2fs的一个符号链接。

如果你在一个没有特定mkfs命令的系统上运行或者在查找特定文件系统的文档时,这一点很重要。

每个文件系统的创建工具都有自己的手册页,比如mke2fs(8)。

在大多数系统上,这不应该是一个问题,因为访问mkfs.ext4(8)手册页应该会重定向到mke2fs(8)手册页,但请记住这一点。

4.2.3 Mounting a Filesystem(挂载文件系统)

On Unix, the process of attaching a filesystem is called mounting. When the system boots, the kernel reads some configuration data and mounts root (/) based on the configuration data.

在Unix系统中,连接文件系统的过程称为挂载。

当系统启动时,内核会读取一些配置数据,并根据配置数据挂载根目录(/)。

In order to mount a filesystem, you must know the following:

要挂载一个文件系统,你需要知道以下信息:

  • The filesystem’s device (such as a disk partition; where the actual file-system data resides)

o 文件系统的设备(如磁盘分区,实际存储文件系统数据的位置)。

  • The filesystem type.

o 文件系统类型。

  • The mount point—that is, the place in the current system’s directory hierarchy where the filesystem will be attached. The mount point is always a normal directory. For instance, you could use /cdrom as a mount point for CD-ROM devices. The mount point need not be directly below /; it may be anywhere on the system.

o 挂载点 -- 即将文件系统连接到当前系统目录层次结构中的位置。

挂载点始终是一个普通目录。例如,你可以使用/cdrom作为CD-ROM设备的挂载点。

挂载点不一定直接位于/下面,它可以位于系统的任何位置。

When mounting a filesystem, the common terminology is “mount a device on a mount point.” To learn the current filesystem status of your system, run mount. The output should look like this:

在挂载文件系统时,常用的术语是“将设备挂载到挂载点上”。

要了解系统当前的文件系统状态,请运行mount命令。输出应该如下所示:

$ mount
/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
--snip--

Each line corresponds to one currently mounted filesystem, with items in this order:

每行对应一个当前已挂载的文件系统,按照以下顺序排列:

  • The device, such as /dev/sda3. Notice that some of these aren’t real devices (proc, for example) but are stand-ins for real device names because these special-purpose filesystems do not need devices.
  • 设备,例如/dev/sda3。请注意,其中一些并非真实设备(例如proc),而是真实设备名称的替代品,因为这些特殊用途的文件系统不需要设备。
  • The word on.
  • 单词on。
  • The mount point.
  • 挂载点。
  • The word type
  • 单词type。
  • The filesystem type, usually in the form of a short identifier
  • 文件系统类型,通常以简短标识符的形式表示。
  • Mount options (in parentheses). (See 4.2.6 Filesystem Mount Options for more details.)
  • 挂载选项(括号中)。(有关更多详细信息,请参阅4.2.6文件系统挂载选项。)

To mount a filesystem, use the mount command as follows with the file-system type, device, and desired mount point:

要挂载文件系统,请使用以下格式的mount命令,指定文件系统类型、设备和挂载点:

mount -t type device mountpoint

For example, to mount the Fourth Extended filesystem /dev/sdf2 on /home/extra, use this command

例如,要将第四扩展文件系统/dev/sdf2挂载到/home/extra上,使用以下命令:

mount -t ext4 /dev/sdf2 /home/extra

You normally don’t need to supply the -t type option because mount can usually figure it out for you. However, sometimes it’s necessary to distinguish between two similar types, such as the various FAT-style filesystems.

通常情况下,您不需要提供-t类型选项,因为mount通常可以自动识别。然而,有时需要区分两种相似的类型,例如各种FAT风格的文件系统。

See 4.2.6 Filesystem Mount Options for a few more long options to mount. To unmount (detach) a filesystem, use the umount command:

有关更多挂载选项的详细信息,请参阅4.2.6文件系统挂载选项。要卸载(解除挂载)文件系统,请使用umount命令:

umount mountpoint

You can also unmount a filesystem with its device instead of its mount point.

4.2.4 Filesystem UUID(文件系统UUID)

The method of mounting filesystems discussed in the preceding section depends on device names. However, device names can change because they depend on the order in which the kernel finds the devices. To solve this problem, you can identify and mount filesystems by their Universally Unique Identifier (UUID), a software standard. The UUID is a type of serial number, and each one should be different. Filesystem creation programs like mke2fs generate a UUID identifier when initializing the filesystem data structure.

在前面的章节中讨论的挂载文件系统的方法依赖于设备名称。

然而,设备名称可能会发生变化,因为它们取决于内核找到设备的顺序。

为了解决这个问题,您可以通过它们的通用唯一标识符(UUID)来识别和挂载文件系统,这是一种软件标准。

UUID是一种序列号类型,每个UUID都应该是不同的。

文件系统创建程序(如mke2fs)在初始化文件系统数据结构时会生成一个UUID标识符。

To view a list of devices and the corresponding filesystems and UUIDs on your system, use the blkid (block ID) program:

要查看系统上设备和相应的文件系统和UUID的列表,请使用blkid(块ID)程序:

# blkid
/dev/sdf2: UUID="a9011c2b-1c03-4288-b3fe-8ba961ab0898" TYPE="ext4"
/dev/sda1: UUID="70ccd6e7-6ae6-44f6-812c-51aab8036d29" TYPE="ext4"
/dev/sda5: UUID="592dcfd1-58da-4769-9ea8-5f412a896980" TYPE="swap"
/dev/sde1: SEC_TYPE="msdos" UUID="3762-6138" TYPE="vfat"

In this example, blkid found four partitions with data: two with ext4 filesystems, one with a swap space signature (see 4.3 swap space), and one with a FAT-based filesystem. The Linux native partitions all have standard UUIDs, but the FAT partition doesn’t have one. You can reference the FAT partition with its FAT volume serial number (in this case, 3762-6138).

在这个例子中,blkid找到了四个带有数据的分区:两个带有ext4文件系统,一个带有交换空间签名(参见4.3交换空间),一个带有基于FAT的文件系统。

Linux本地分区都有标准的UUID,但是FAT分区没有。您可以使用其FAT卷序列号(在本例中为3762-6138)引用FAT分区。

To mount a filesystem by its UUID, use the UUID= syntax. For example, to mount the first filesystem from the preceding list on /home/extra, enter:

要通过其UUID挂载文件系统,请使用UUID=语法。例如,要将前面列表中的第一个文件系统挂载到/home/extra上,请输入:

mount UUID=a9011c2b-1c03-4288-b3fe-8ba961ab0898 /home/extra

You will typically not manually mount filesystems by UUID as above, because you’ll probably know the device, and it’s much easier to mount a device by its name than by its crazy UUID number. Still, it’s important to understand UUIDs. For one thing, they’re the preferred way to automatically mount filesystems in /etc/fstab at boot time (see 4.2.8 The /etc/fstab Filesystem Table). In addition, many distributions use the UUID as a mount point when you insert removable media. In the preceding example, the FAT filesystem is on a flash media card. An Ubuntu system with someone logged in will mount this partition at /media/3762-6138 upon insertion. The udevd daemon described in Chapter 3 handles the initial event for the device insertion.

通常情况下,你不会像上面那样手动按UUID挂载文件系统,因为你很可能已经知道设备的名称,按名称挂载设备比按疯狂的UUID号更容易。

然而,理解UUID仍然很重要。首先,它们是在引导时自动挂载文件系统到/etc/fstab文件系统表中的首选方式(参见4.2.8节《/etc/fstab文件系统表》)。

此外,许多发行版在插入可移动介质时使用UUID作为挂载点。

在上面的例子中,FAT文件系统位于闪存媒体卡上。

登录到Ubuntu系统的用户会在插入时将此分区挂载到/media/3762-6138。

第3章中描述的udev守护进程会处理设备插入的初始事件。

You can change the UUID of a filesystem if necessary (for example, if you copied the complete filesystem from somewhere else and now need to distinguish it from the original). See the tune2fs(8) manual page for how to do this on an ext2/ext3/ext4 filesystem.

如果需要的话,你可以更改文件系统的UUID(例如,如果你从其他地方复制了完整的文件系统,现在需要将其与原始文件系统区分开)。

请参阅tune2fs(8)手册页,了解如何在ext2/ext3/ext4文件系统上进行此操作。

4.2.5 Disk Buffering, Caching, and Filesystems(磁盘缓冲、缓存和文件系统)

Linux, like other versions of Unix, buffers writes to the disk. This means that the kernel usually doesn’t immediately write changes to filesystems when processes request changes. Instead it stores the changes in RAM until the kernel can conveniently make the actual change to the disk. This buffering system is transparent to the user and improves performance.

Linux和其他Unix版本一样,会对写入磁盘进行缓冲。

这意味着内核通常不会立即将进程请求的更改写入文件系统。

相反,它会将更改存储在RAM中,直到内核方便地将实际更改写入磁盘。

这种缓冲系统对用户透明,并提高了性能。

When you unmount a filesystem with umount, the kernel automatically synchronizes with the disk. At any other time, you can force the kernel to write the changes in its buffer to the disk by running the sync command. If for some reason you can’t unmount a filesystem before you turn off the system, be sure to run sync first.

当您使用umount卸载文件系统时,内核会自动与磁盘同步。

在其他任何时间,您可以通过运行sync命令强制内核将其缓冲区中的更改写入磁盘。

如果由于某种原因您无法在关闭系统之前卸载文件系统,请务必先运行sync。

In addition, the kernel has a series of mechanisms that use RAM to automatically cache blocks read from a disk. Therefore, if one or more processes repeatedly access a file, the kernel doesn’t have to go to the disk again and again—it can simply read from the cache and save time and resources

此外,内核还具有一系列机制,使用RAM自动缓存从磁盘读取的块。因此,如果一个或多个进程反复访问一个文件,内核就不必一次又一次地访问磁盘,而是可以直接从缓存中读取,从而节省时间和资源。

4.2.6 Filesystem Mount Options 文件系统挂载选项

There are many ways to change the mount command behavior, as is often necessary with removable media or when performing system maintenance. In fact, the total number of mount options is staggering. The extensive mount(8) manual page is a good reference, but it’s hard to know where to start and what you can safely ignore. You’ll see the most useful options in this section

有许多方法可以改变挂载命令的行为,这在使用可移动介质或进行系统维护时经常是必要的。

事实上,挂载选项的总数是惊人的。

详尽的mount(8)手册页是一个很好的参考,但很难知道从哪里开始和哪些选项可以安全地忽略。

在本节中,您将看到最有用的选项。

Options fall into two rough categories: general and filesystem-specific ones. General options include -t for specifying the filesystem type (as mentioned earlier). In contrast, a filesystem-specific option pertains only to certain filesystem types.

选项分为两大类:通用选项和特定于文件系统的选项。

通用选项包括指定文件系统类型的 -t 选项(如前面所述)。相反,特定于文件系统的选项仅适用于特定的文件系统类型。

To activate a filesystem option, use the -o switch followed by the option. For example, -o norock turns off Rock Ridge extensions on an ISO 9660 file-system, but it has no meaning for any other kind of filesystem.

要激活文件系统选项,请使用 -o 开关,后跟选项。

例如,-o norock 在ISO 9660文件系统上关闭Rock Ridge扩展,但对于其他任何类型的文件系统都没有意义。

Short Options(短选项)

The most important general options are these:

最重要的通用选项如下:

  • -r The -r option mounts the filesystem in read-only mode. This has a number of uses, from write protection to bootstrapping. You don’t need to specify this option when accessing a read-only device such as a CD-ROM; the system will do it for you (and will also tell you about the read-only status).

o -r 选项以只读模式挂载文件系统。这有许多用途,从写保护到引导。

当访问只读设备(如CD-ROM)时,您无需指定此选项;系统会为您执行(并且还会告诉您只读状态)。

  • -n The -n option ensures that mount does not try to update the system runtime mount database, /etc/mtab. The mount operation fails when it cannot write to this file, which is important at boot time because the root partition (and, therefore, the system mount database) is read-only at first. You’ll also find this option handy when trying to fix a system problem in single-user mode, because the system mount database may not be available at the time.

o -n 选项确保挂载不会尝试更新系统运行时挂载数据库 /etc/mtab。

当无法写入此文件时,挂载操作将失败,这在启动时非常重要,因为根分区(因此也是系统挂载数据库)起初是只读的。

当您尝试在单用户模式下修复系统问题时,您还会发现此选项很方便,因为系统挂载数据库可能在那时不可用。

  • -t The -t type option specifies the filesystem type.

o -t type 选项指定文件系统类型。

Long Options(长选项)

Short options like -r are too limited for the ever-increasing number of mount options; there are too few letters in the alphabet to accommodate all possible options. Short options are also troublesome because it is difficult to determine an option’s meaning based on a single letter. Many general options and all filesystemspecific options use a longer, more flexible option format

像 -r 这样的短选项对于越来越多的挂载选项来说太有限了;字母表中的字母太少,无法容纳所有可能的选项。

短选项也很麻烦,因为很难根据一个字母确定选项的含义。

许多通用选项和所有特定于文件系统的选项都使用更长、更灵活的选项格式。

To use long options with mount on the command line, start with -o and supply some keywords. Here’s a complete example, with the long options following -o:

要在命令行上使用mount的长选项,请以 -o 开始,然后提供一些关键字。以下是一个完整的例子,长选项在 -o 之后:

mount -t vfat /dev/hda1 /dos -o ro,conv=auto

The two long options here are ro and conv=auto. The ro option specifies read-only mode and is the same as the -r short option. The conv=auto option tells the kernel to automatically convert certain text files from the DOS newline format to the Unix style (you’ll see more shortly).

这里有两个长选项,分别是roconv=autoro选项指定只读模式,与短选项-r相同。conv=auto选项告诉内核自动将某些文本文件从DOS的换行格式转换为Unix风格(稍后会详细介绍)。

The most useful long options are these:

最有用的长选项如下:

o exec, noexec Enables or disables execution of programs on the filesystem.
o suid, nosuid Enables or disables setuid programs.
o ro Mounts the filesystem in read-only mode (as does the -r short option).
o rw Mounts the filesystem in read-write mode.
o conv=rule (FAT-based filesystems) Converts the newline characters in files based on rule, which can be binary, text, or auto. The default is binary, which disables any character translation. To treat all files as text, use text. The auto setting converts files based on their extension. For example, a .jpg file gets no special treatment, but a .txt file does. Be careful with this option because it can damage files. Consider using it in read-only mode.

  • exec, noexec:启用或禁用文件系统上的程序执行。
  • suid, nosuid:启用或禁用setuid程序。
  • ro:以只读模式挂载文件系统(与短选项-r相同)。
  • rw:以读写模式挂载文件系统。
  • conv=rule(基于FAT的文件系统):根据规则转换文件中的换行符,规则可以是二进制、文本或自动。默认值是二进制,即禁用任何字符转换。要将所有文件视为文本,请使用文本。自动设置会根据文件的扩展名进行转换。例如,.jpg文件不会受到特殊处理,但.txt文件会。请谨慎使用此选项,因为它可能会损坏文件。建议在只读模式下使用。

4.2.7 Remounting a Filesystem(重新挂载文件系统)

There will be times when you may need to reattach a currently mounted filesystem at the same mount point when you need to change mount options. The most common such situation is when you need to make a readonly file-system writable during crash recovery.

有时,当您需要更改挂载选项时,可能需要在同一挂载点重新挂载当前已挂载的文件系统。最常见的情况是在崩溃恢复期间需要使只读文件系统可写。

The following command remounts the root in read-write mode (you need the -n option because the mount command can’t write to the system mount database when the root is read-only):

以下命令以读写模式重新挂载根文件系统(需要使用 -n 选项,因为当根文件系统为只读时,挂载命令无法写入系统挂载数据库):

 mount -n -o remount /

This command assumes that the correct device listing for / is in /etc/fstab (as discussed in the next section). If it is not, you must specify the device.

4.2.8 /etc/fstab 文件系统表

To mount filesystems at boot time and take the drudgery out of the mount command, Linux systems keep a permanent list of filesystems and options in /etc/fstab. This is a plaintext file in a very simple format, as Listing 4-1 shows.

为了在启动时挂载文件系统并省去挂载命令的繁琐工作,Linux 系统会在 /etc/fstab 中保存一份永久的文件系统和选项列表。文件系统和选项的永久列表。

这是一个格式非常简单的纯文本文件,如清单 4-1 所示。

Example 4-1. List of filesystems and options in /etc/fstab

例 4-1. /etc/fstab 中的文件系统和选项列表

proc /proc proc nodev,noexec,nosuid 0 0
UUID=70ccd6e7-6ae6-44f6-812c-51aab8036d29 / ext4 errors=remount-ro 0 1
UUID=592dcfd1-58da-4769-9ea8-5f412a896980 none swap sw 0 0
/dev/sr0 /cdrom iso9660 ro,user,nosuid,noauto 0 0

Each line corresponds to one filesystem, each of which is broken into six fields. These fields are as follows, in order from left to right:

每一行对应一个文件系统,每个文件系统分为六个字段。

这些字段如下、 按从左到右的顺序排列:

o The device or UUID. Most current Linux systems no longer use the device in /etc/fstab, preferring the UUID. (Notice that the /proc entry has a stand-in device named proc.)
o The mount point. Indicates where to attach the filesystem.
o The filesystem type. You may not recognize swap in this list; this is a swap partition (see 4.3 swap space). o Options. Use long options separated by commas.
o Backup information for use by the dump command. You should always use a 0 in this field.
o The filesystem integrity test order. To ensure that fsck always runs on the root first, always set this to 1 for the root filesystem and 2 for any other filesystems on a hard disk. Use 0 to disable the bootup check for everything else, including CD-ROM drives, swap, and the /proc file-system (see the fsck command in
4.2.11 Checking and Repairing Filesystems).

o 设备或 UUID。当前大多数 Linux 系统不再使用 /etc/fstab 中的设备,而更喜欢使用
UUID。(注意,/proc 条目有一个名为 proc 的备用设备)。
o 挂载点。表示文件系统的挂载点。
o 文件系统类型。在此列表中可能找不到 swap;这是一个交换分区(参见 4.3 交换空间)。
o 选项。使用由逗号分隔的长选项。
o 用于 dump 命令的备份信息。在此字段中应始终使用 0。文件系统完整性测试顺序。为了确保fsck始终首先在根文件系统上运行,应将其设置为1,对于硬盘上的任何其他文件系统都设置为2。使用0来禁用启动时的检查,包括CD-ROM驱动器、交换空间和/proc文件系统(请参见4.2.11节“检查和修复文件系统”中的fsck命令)。

When using mount, you can take some shortcuts if the filesystem you want to work with is in /etc/fstab. For example, if you were using Listing 4-1 and mounting a CD-ROM, you would simply run mount /cdrom.

在使用mount时,如果要使用的文件系统位于/etc/fstab中,可以采取一些捷径。

例如,如果使用列表4-1并挂载CD-ROM,只需运行mount /cdrom即可。

You can also try to mount all entries at once in /etc/fstab that do not contain the noauto option with this command:

您还可以使用以下命令尝试一次性挂载/etc/fstab中不包含noauto选项的所有条目:

mount -a

Listing 4-1 contains some new options, namely errors, noauto, and user, because they don’t apply outside the /etc/fstab file. In addition, you’ll often see the defaults option here. The meanings of these options are as follows:

列表4-1包含一些新选项,即errors、noauto和user,因为它们不适用于/etc/fstab文件之外。

此外,您经常会在此处看到defaults选项。这些选项的含义如下:

o defaults This uses the mount defaults: read-write mode, enable device files, executables, the setuid bit, and so on. Use this when you don’t want to give the filesystem any special options but you do want to fill all fields in /etc/fstab.
o errors This ext2-specific parameter sets the kernel behavior when the system has trouble mounting a filesystem. The default is normally errors=continue, meaning that the kernel should return an error code and keep running. To have the kernel try the mount again in read-only mode, use errors=remount-ro. The errors=panic setting tells the kernel (and your system) to halt when there is a problem with the mount.
o noauto This option tells a mount -a command to ignore the entry. Use this to prevent a boot-time mount of a removable-media device, such as a CD-ROM or floppy drive.
o user This option allows unprivileged users to run mount on a particular entry, which can be handy for allowing access to CD-ROM drives. Because users can put a setuid-root file on removable media with
another system, this option also sets nosuid, noexec, and nodev (to bar special device files).

o defaults:使用默认挂载选项:读写模式,启用设备文件,可执行文件,setuid位等。当您不想为文件系统提供任何特殊选项,但又想填写/etc/fstab中的所有字段时,请使用此选项。

o errors:此ext2特定参数设置系统在挂载文件系统时出现问题时的内核行为。默认情况下通常是errors=continue,意味着内核应返回错误代码并继续运行。要让内核尝试以只读模式重新挂载,请使用errors=remount-ro。当挂载出现问题时,errors=panic设置告诉内核(和您的系统)停止运行。

o noauto:此选项告诉mount -a命令忽略该条目。使用此选项可以防止启动时挂载可移动介质设备,如CD-ROM或软盘驱动器。

o user:此选项允许非特权用户在特定条目上运行mount命令,这对于允许访问CD-ROM驱动器非常方便。

因为用户可以在另一个系统上将一个setuid-root文件放在可移动介质上,所以此选项还设置了nosuid、noexec和nodev(以阻止特殊设备文件的使用)。

4.2.9 Alternatives to /etc/fstab(/etc/fstab的替代方案)

Although the /etc/fstab file has been the traditional way to represent filesystems and their mount points, two new alternatives have appeared. The first is an /etc/fstab.d directory that contains individual filesystem configuration files (one file for each filesystem). The idea is very similar to many other configuration directories that you’ll see throughout this book.

虽然/etc/fstab文件一直是表示文件系统及其挂载点的传统方式,但出现了两种新的替代方案。

第一种是一个包含单独文件系统配置文件(每个文件系统一个文件)的/etc/fstab.d目录。

这个想法与本书中你将看到的许多其他配置目录非常相似。

A second alternative is to configure systemd units for the filesystems. You’ll learn more about systemd and its units in Chapter 6. However, the systemd unit configuration is often generated from (or based on) the /etc/ fstab file, so you may find some overlap on your system.

第二种替代方案是为文件系统配置systemd单位。

在第6章中,你将了解更多关于systemd及其单位的内容。

然而,systemd单位配置通常是从/etc/fstab文件生成的(或基于该文件),因此你可能会在你的系统上找到一些重叠。

4.2.10 Filesystem Capacity(文件系统容量)

To view the size and utilization of your currently mounted filesystems, use the df command. The output should look like this:

要查看当前挂载的文件系统的大小和利用率,请使用df命令。

输出应该如下所示:

$ df
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/sda1 1011928 71400 889124 7% /
/dev/sda3 17710044 9485296 7325108 56% /usr

Here’s a brief description of the fields in the df output:

以下是对df输出中各个字段的简要描述:

o Filesystem. The filesystem device
o 1024-blocks. The total capacity of the filesystem in blocks of 1024 bytes
o Used. The number of occupied blocks
o Available. The number of free blocks
o Capacity. The percentage of blocks in use
o Mounted on. The mount point

o 文件系统。文件系统设备
o 1024块。文件系统以1024字节的块为单位的总容量
o 已使用。已占用的块数
o 可用。可用的块数
o 容量。已使用块的百分比
o 挂载点。挂载点

It should be easy to see that the two filesystems here are roughly 1GB and 17.5GB in size. However, the capacity numbers may look a little strange because 71,400 plus 889,124 does not equal 1,011,928, and 9,485,296 does not constitute 56 percent of 17,710,044. In both cases, 5 percent of the total capacity is unaccounted for. In fact, the space is there, but it is hidden in reserved blocks. Therefore, only the superuser can use the full filesystem space if the rest of the partition fills up. This feature keeps system servers from immediately failing when they run out of disk space.

很容易看出这两个文件系统的大小大约为1GB和17.5GB。

然而,容量数字可能看起来有些奇怪,因为71,400加上889,124并不等于1,011,928,而9,485,296也不构成17,710,044的56%。在这两种情况下,总容量中有5%的空间无法解释。

实际上,这些空间是存在的,但是它们被隐藏在保留块中。

因此,只有超级用户在分区剩余空间用尽时才能使用完整的文件系统空间。这个特性可以确保系统服务器在磁盘空间用尽时不会立即失败。

If your disk fills up and you need to know where all of those space-hogging media files are, use the du command. With no arguments, du prints the disk usage of every directory in the directory hierarchy, starting at the current working directory. (That’s kind of a mouthful, so just run cd /; du to get the idea. Press CTRL-C when you get bored.) The du -s command turns on summary mode to print only the grand total. To evaluate a particular directory, change to that directory and run du -s *.

如果您的磁盘空间用尽了,而您需要知道哪些占用空间较大的媒体文件在哪里,可以使用du命令。

不带参数的du命令会打印出目录层次结构中每个目录的磁盘使用情况,从当前工作目录开始。

(这有点啰嗦,所以只需运行cd /; du来理解。

当您感到无聊时,按下CTRL-C即可停止。)du -s命令打开摘要模式,只打印总计。

要评估特定目录,请切换到该目录并运行du -s * 。

NOTE

The POSIX standard defines a block size of 512 bytes. However, this size is harder to read, so by default, the df and du output in most Linux distributions is in 1024-byte blocks. If you insist on displaying the numbers in 512-byte blocks, set the POSIXLY_CORRECT environment variable. To explicitly specify 1024-byte blocks, use the -k option (both utilities support this). The df program also has a -m option to list capacities in 1MB blocks and a -h option to take a best guess at what a person can read.|

注意:

POSIX标准定义了512字节的块大小。

然而,这个大小不太容易阅读,所以在大多数Linux发行版中,默认情况下,df和du的输出以1024字节的块为单位。如果您坚持以512字节的块显示数字,请设置POSIXLY_CORRECT环境变量。

要明确指定1024字节的块,请使用-k选项(这两个实用程序都支持此选项)。

df程序还有一个-m选项,可以以1MB块的形式列出容量,以及一个-h选项,可以猜测人类可以阅读的最佳格式。

4.2.11 Checking and Repairing Filesystems(检查和修复文件系统)

The optimizations that Unix filesystems offer are made possible by a sophisticated database mechanism. For filesystems to work seamlessly, the kernel has to trust that there are no errors in a mounted filesystem. If errors exist, data loss and system crashes may result.

Unix 文件系统提供的优化是通过复杂的数据库机制实现的。

为了使文件系统无缝工作,内核必须相信已安装的文件系统中没有错误。

如果存在错误,可能会导致数据丢失和系统崩溃。

Filesystem errors are usually due to a user shutting down the system in a rude way (for example, by pulling out the power cord). In such cases, the filesystem cache in memory may not match the data on the disk, and the system also may be in the process of altering the filesystem when you happen to give the computer a kick. Although a new generation of filesystems supports journals to make filesystem corruption far less common, you should always shut the system down properly. And regardless of the filesystem in use, filesystem checks are still necessary every now and to maintain sanity.

文件系统错误通常是由于用户以粗鲁的方式关闭系统(例如,拔掉电源线)造成的。

在这种情况下,内存中的文件系统缓存可能与磁盘上的数据不匹配,并且当您碰巧启动计算机时,系统也可能正在更改文件系统。

尽管新一代文件系统支持日志,使文件系统损坏的情况大大减少,但您应该始终正确关闭系统。

无论使用什么文件系统,文件系统检查仍然是必要的,以保持理智。

The tool to check a filesystem is fsck. As with the mkfs program, there is a different version of fsck for each filesystem type that Linux supports. For example, when you run fsck on an Extended filesystem series (ext2/ ext3/ext4), fsck recognizes the filesystem type and starts the e2fsck utility. Therefore, you generally don’t need to type e2fsck, unless fsck can’t figure out the filesystem type or you’re looking for the e2fsck manual page.

检查文件系统的工具是 fsck。

与 mkfs 程序一样,Linux 支持的每种文件系统类型都有不同版本的 fsck。

例如,当您在扩展文件系统系列 (ext2/ext3/ext4) 上运行 fsck 时,fsck 会识别文件系统类型并启动 e2fsck 实用程序。

因此,您通常不需要输入 e2fsck,除非 fsck 无法确定文件系统类型或者您正在查找 e2fsck 手册页。

The information presented in this section is specific to the Extended filesystem series and e2fsck

本节中提供的信息特定于扩展文件系统系列和 e2fsck

To run fsck in interactive manual mode, give the device or the mount point (as listed in /etc/fstab) as the argument. For example:

要在交互式手动模式下运行 fsck,请将设备或安装点(如 /etc/fstab 中列出)作为参数。 例如:

fsck /dev/sdb1
WARNING
You should never use fsck on a mounted filesystem because the kernel may alter the disk data  as you run the check, causing runtime mismatches that can crash your system and corrupt files. There is only one exception: If you mount the root partition read-only in single-user mode, you may use fsck on it.

警告

您绝对不应该在已挂载的文件系统上使用fsck,因为在运行检查时,内核可能会修改磁盘数据,导致运行时不匹配,可能会导致系统崩溃和文件损坏。

只有一种例外情况:如果您在单用户模式下以只读方式挂载根分区,可以对其使用fsck。

In manual mode, fsck prints verbose status reports on its passes, which should look something like this when there are no problems:

在手动模式下,fsck会打印其通过的详细状态报告,当没有问题时,应该看起来像这样:

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information /dev/sdb1: 11/1976 files (0.0% 
non-contiguous), 265/7891 blocks

If fsck finds a problem in manual mode, it stops and asks you a question relevant to fixing the problem. These questions deal with the internal structure of the filesystem, such as reconnecting loose inodes and clearing blocks (an inode is a building block of the filesystem; you’ll see how inodes work in 4.5 Inside a Traditional Filesystem). When fsck asks you about reconnecting an inode, it has found a file that doesn’t appear to have a name. When reconnecting such a file, fsck places the file in the lost+found directory in the filesystem, with a number as the filename. If this happens, you need to guess the name based on the content of the file; the original name is probably gone.

如果fsck在手动模式下发现问题,它会停止并向您提出与修复问题相关的问题。

这些问题涉及文件系统的内部结构,例如重新连接松散的inode和清除块(inode是文件系统的构建块;您将在4.5节“传统文件系统内部”中了解到inode的工作原理)。

当fsck询问您是否重新连接inode时,它找到了一个似乎没有名称的文件。

在重新连接这样的文件时,fsck将文件放置在文件系统中的lost+found目录中,并以数字作为文件名。

如果发生这种情况,您需要根据文件的内容猜测名称;原始名称可能已丢失。

In general, it’s pointless to sit through the fsck repair process if you’ve just uncleanly shut down the system, because fsck may have a lot of minor errors to fix. Fortunately, e2fsck has a -p option that automatically fixes ordinary problems without asking and aborts when there’s a serious error. In fact, Linux distributions run some variant of fsck -p at boot time. (You may also see fsck -a, which just does the same thing.)

一般来说,如果您刚刚不正常关闭系统,坐在那里等待fsck修复过程是没有意义的,因为fsck可能有很多次要错误需要修复。

幸运的是,e2fsck有一个-p选项,可以自动修复普通问题,而不需要询问,并在出现严重错误时中止。

事实上,Linux发行版在启动时运行某个变体的fsck -p命令。

(您还可能看到fsck -a,它只是做同样的事情。)

If you suspect a major disaster on your system, such as a hardware failure or device misconfiguration, you need to decide on a course of action because fsck can really mess up a filesystem that has larger problems. (One telltale sign that your system has a serious problem is that fsck asks a lot of questions in manual mode.)

如果您怀疑系统出现重大灾难,例如硬件故障或设备配置错误,您需要决定采取行动,因为fsck可能会破坏具有更大问题的文件系统。

(一个明显的迹象是,如果fsck在手动模式下问了很多问题。)

If you think that something really bad has happened, try running fsck -n to check the filesystem without modifying anything. If there’s a problem with the device configuration that you think you can fix (such as an incorrect number of blocks in the partition table or loose cables), fix it before running fsck for real, or you’re likely to lose a lot of data.

如果您认为发生了严重问题,请尝试运行fsck -n以在不修改任何内容的情况下检查文件系统。

如果您认为设备配置有问题,您认为可以修复(例如,分区表中块的数量不正确或电缆松动),请在真正运行fsck之前修复它,否则您可能会丢失大量数据。

If you suspect that only the superblock is corrupt (for example, because someone wrote to the beginning of the disk partition), you might be able to recover the filesystem with one of the superblock backups that mkfs creates. Use fsck -b num to replace the corrupted superblock with an alternate at block num and hope for the best.

如果您怀疑只有超级块损坏(例如,因为有人写入了磁盘分区的开头),您可能可以使用mkfs创建的超级块备份之一来恢复文件系统。

使用fsck -b num将损坏的超级块替换为块num处的备用超级块,并希望一切顺利。

If you don’t know where to find a backup superblock, you may be able to run mkfs -n on the device to view a list of superblock backup numbers without destroying your data. (Again, make sure that you’re using -n, or you’ll really tear up the filesystem.)

如果您不知道在哪里找到备份超级块,您可以在设备上运行mkfs -n以查看超级块备份编号的列表,而不会破坏您的数据。

(再次确保您使用的是-n选项,否则您将真正破坏文件系统。)

Checking ext3 and ext4 Filesystems( 检查 ext3 和 ext4 文件系统)

You normally do not need to check ext3 and ext4 filesystems manually because the journal ensures data integrity. However, you may wish to mount a broken ext3 or ext4 filesystem in ext2 mode because the kernel will not mount an ext3 or ext4 filesystem with a nonempty journal. (If you don’t shut your system down cleanly, you can expect the journal to contain some data.) To flush the journal in an ext3 or ext4 filesystem to the regular filesystem database, run e2fsck as follows:

通常情况下,您不需要手动检查 ext3 和 ext4 文件系统,因为日志可确保数据完整性。

不过,您可能希望在 ext2 模式下挂载已损坏的 ext3 或 ext4 文件系统,因为内核不会挂载日志不为空的 ext3 或 ext4 文件系统。

(如果不彻底关闭系统,日志中可能会包含一些数据)。

要将 ext3 或 ext4 文件系统中的日志清除到常规文件系统数据库中,请按以下步骤运行 e2fsck:

e2fsck –fy /dev/disk_device
The Worst Case(最坏的情况)

Disk problems that are worse in severity leave you with few choices:

磁盘问题严重时,你的选择会变得很少:

o You can try to extract the entire filesystem image from the disk with dd and transfer it to a partition on another disk of the same size.
o You can try to patch the filesystem as much as possible, mount it in read-only mode, and salvage what you can.
o You can try debugfs.

o 你可以尝试使用dd从磁盘中提取整个文件系统镜像,并将其传输到另一个相同大小的磁盘的分区中。
o 你可以尝试尽可能修复文件系统,以只读模式挂载它,并尽力挽救你能够挽救的东西。
o 你可以尝试使用debugfs。

In the first two cases, you still need to repair the filesystem before you mount it, unless you feel like picking through the raw data by hand. If you like, you can choose to answer y to all of the fsck questions by entering fsck -y, but do this as a last resort because issues may come up during the repair process that you would rather handle manually.

在前两种情况下,你仍然需要在挂载之前修复文件系统,除非你愿意手动检查原始数据。

如果你愿意,你可以选择在所有fsck问题上回答y,输入fsck -y,但这只能作为最后的手段,因为在修复过程中可能会出现你宁愿手动处理的问题。

The debugfs tool allows you to look through the files on a filesystem and copy them elsewhere. By default, it opens filesystems in read-only mode. If you’re recovering data, it’s probably a good idea to keep your files intact to avoid messing things up further.

debugfs工具允许你查看文件系统中的文件并将其复制到其他位置。

默认情况下,它以只读模式打开文件系统。如果你正在恢复数据,最好保持文件的完整性,以避免进一步搞砸事情。

Now, if you’re really desperate, say with a catastrophic disk failure on your hands and no backups, there isn’t a lot you can do other than hope a professional service can “scrape the platters.”

现在,如果你真的很绝望,比如手头出现灾难性的磁盘故障,而又没有备份,除了希望专业服务能够“刮擦盘片”之外,你几乎没有什么办法了。

4.2.12 Special-Purpose Filesystems(特殊用途文件系统)

Not all filesystems represent storage on physical media. Specifically, most versions of Unix have filesystems that serve as system interfaces. That is, rather than serving only as a means to store data on a device, a filesystem can represent system information such as process IDs and kernel diagnostics. This idea goes back to the /dev mechanism, which is an early model of using files for I/O interfaces. The /proc idea came from the eighth edition of research Unix, implemented by Tom J. Killian and accelerated when Bell Labs (including many of the original Unix designers) created Plan 9—a research operating system that took filesystem abstraction to a whole new level (http://plan9.bell-labs.com/sys/doc/9.html).

并非所有的文件系统都代表物理介质上的存储。

具体来说,大多数Unix版本都有作为系统接口的文件系统。

也就是说,文件系统不仅仅作为在设备上存储数据的手段,还可以表示系统信息,如进程ID和内核诊断。

这个想法可以追溯到/dev机制,它是使用文件进行I/O接口的早期模型。

/proc的想法来自于研究Unix的第八版,由Tom J. Killian实现,并在贝尔实验室(包括许多原始Unix设计师)创建Plan 9时加速发展——Plan 9是一个将文件系统抽象提升到一个全新水平的研究操作系统(http://plan9.bell-labs.com/sys/doc/9.html)。

The special filesystem types in common use on Linux include the following:

常用的Linux特殊文件系统类型包括以下几种:

o proc. Mounted on /proc. The name proc is actually an abbreviation for process. Each numbered directory inside /proc is actually the process ID of a current process on the system; the files in those directories represent various aspects of the processes. The file /proc/self represents the current process. The Linux proc filesystem includes a great deal of additional kernel and hardware information in files like /proc/cpuinfo. (There has been a push to move information unrelated to processes out of /proc and into /sys.)

o proc。挂载在/proc上。proc实际上是process的缩写。

/proc内的每个编号目录实际上是系统上当前进程的进程ID;这些目录中的文件表示进程的各个方面。

文件/proc/self表示当前进程。

Linux的proc文件系统还在像/proc/cpuinfo这样的文件中包含了大量的额外内核和硬件信息。

(有人提议将与进程无关的信息从/proc中移出并放入/sys中。)

o sysfs. Mounted on /sys. (You saw this in Chapter 3.)

o sysfs。挂载在/sys上。(在第三章中已经介绍过。)

o tmpfs. Mounted on /run and other locations. With tmpfs, you can use your physical memory and swap space as temporary storage. For example, you can mount tmpfs where you like, using the size and nr_blocks long options to control the maximum size. However, be careful not to constantly pour things into a tmpfs because your system will eventually run out of memory and programs will start to crash. (For years, Sun Microsystems used a version of tmpfs for /tmp that caused problems on long-running systems.)

o tmpfs。挂载在/run和其他位置上。

使用tmpfs,您可以将物理内存和交换空间用作临时存储。

例如,您可以在需要的位置挂载tmpfs,使用size和nr_blocks选项来控制最大大小。

但是,要小心不要不断往tmpfs中添加内容,因为系统最终会耗尽内存,程序将开始崩溃。

(多年来,Sun Microsystems在长时间运行的系统上使用了一个版本的tmpfs作为/tmp,这在一些系统上会引起问题。)

4.3 swap space(交换空间)

Not every partition on a disk contains a filesystem. It’s also possible to augment the RAM on a machine with disk space. If you run out of real memory, the Linux virtual memory system can automatically move pieces of memory to and from a disk storage. This is called swapping because pieces of idle programs are swapped to the disk in exchange for active pieces residing on the disk. The disk area used to store memory pages is called swap space (or just swap for short).

并非磁盘上的每个分区都包含文件系统。也可以用磁盘空间来增加机器上的内存。

如果实际内存不足,Linux 虚拟内存系统会自动将内存碎片移入或移出磁盘存储空间。

这就是所谓的 "交换"(swapping),因为闲置程序的片段会被交换到磁盘上,以换取磁盘上的活动片段。

用于存储内存页的磁盘区域称为交换空间(简称交换)。

The free command’s output includes the current swap usage in kilobytes as follows:

free 命令的输出包括以千字节为单位的当前交换使用情况,如下所示:

$ free
 total used free
--snip--
Swap: 514072 189804 324268

4.3.1 Using a Disk Partition as Swap Space

To use an entire disk partition as swap, follow these steps:

要将整个磁盘分区用作交换分区,请按以下步骤操作:

  1. Make sure the partition is empty.
  2. Run mkswap dev, where dev is the partition’s device. This command puts a swap signature on the partition.
  3. Execute swapon dev to register the space with the kernel.
  4. 确保分区为空。
  5. 运行 mkswap dev,其中 dev 是分区的设备。该命令会在分区上设置交换签名。
  6. 执行 swapon dev 向内核注册空间。

After creating a swap partition, you can put a new swap entry in your /etc/fstab file to make the system use the swap space as soon as the machine boots. Here is a sample entry that uses /dev/sda5 as a swap partition:

创建交换分区后,你可以在/etc/fstab 文件中添加一个新的交换条目,让系统在机器启动后立即使用交换空间。

下面是一个使用 /dev/sda5 作为交换分区的示例条目:

/dev/sda5 none swap sw 0 0

Keep in mind that many systems now use UUIDs instead of raw device names.

请记住,现在许多系统都使用 UUID 代替原始设备名称。

4.3.2 Using a File as Swap Space(将文件用作交换空间)

You can use a regular file as swap space if you’re in a situation where you would be forced to repartition a disk in order to create a swap partition. You shouldn’t notice any problems when doing this.

如果你处于一种情况下,需要强制重新分区磁盘才能创建交换分区,你可以使用普通文件作为交换空间。

在这样做时,你不应该遇到任何问题。

Use these commands to create an empty file, initialize it as swap, and add it to the swap pool:

使用以下命令创建一个空文件,并将其初始化为交换空间,并将其添加到交换池中:

# dd if=/dev/zero of=swap_file bs=1024k count=num_mb
# mkswap swap_file
# swapon swap_file

Here, swap_file is the name of the new swap file, and num_mb is the desired size, in megabytes.

其中,swap_file 是新交换文件的名称,num_mb 是所需的大小,单位为兆字节。

To remove a swap partition or file from the kernel’s active pool, use the swapoff command.

要从内核的活动池中删除交换分区或文件,请使用 swapoff 命令。

4.3.3 How Much Swap Do You Need?(您需要多少交换?)

At one time, Unix conventional wisdom said you should always reserve at least twice as much swap as you have real memory. Today, not only do the enormous disk and memory capacities available cloud the issue, but so do the ways we use the system. On one hand, disk space is so plentiful that it’s tempting to allocate more than double the memory size. On the other hand, you may never even dip into your swap space because you have so much real memory

Unix传统智慧曾经认为,你应该保留至少两倍于实际内存大小的交换空间

如今,庞大的磁盘和内存容量不仅使问题变得复杂,我们使用系统的方式也是如此。

一方面,磁盘空间如此丰富,以至于很容易分配超过内存大小两倍的空间。

另一方面,你可能永远都不会使用到交换空间,因为你的实际内存非常充足。

The “double the real memory” rule dated from a time when multiple users would be logged into one machine at a time. Not all of them would be active, though, so it was convenient to be able to swap out the memory of the inactive users when an active user needed more memory.

“实际内存的两倍”规则源自于多个用户同时登录到一台机器的时代。

然而,并不是所有用户都会活跃,因此当一个活跃用户需要更多内存时,交换出非活跃用户的内存是很方便的。

The same may still hold true for a single-user machine. If you’re running many processes, it’s generally fine to swap out parts of inactive processes or even inactive pieces of active processes. However, if you’re constantly using the swap space because many active processes want to use the memory at once, you will suffer serious performance problems because disk I/O is just too slow to keep up with the rest of the system. The only solutions are to buy more memory, terminate some processes, or complain.

对于单用户机器来说,可能依然适用。

如果你运行了很多进程,通常可以将非活跃进程的部分或者甚至活跃进程中的非活跃部分交换出去。

然而,如果你不断使用交换空间,因为许多活跃进程同时需要使用内存,你将遇到严重的性能问题,因为磁盘I/O速度太慢,跟不上系统的其他部分。

唯一的解决办法是购买更多内存,终止一些进程或者进行投诉。

Sometimes, the Linux kernel may choose to swap out a process in favor of a little more disk cache. To prevent this behavior, some administrators configure certain systems with no swap space at all. For example, highperformance network servers should never dip into swap space and should avoid disk access if at all possible

有时,Linux内核可能会选择交换出一个进程,以获取更多的磁盘缓存。

为了防止这种行为,一些管理员会配置某些系统根本不使用交换空间。

例如,高性能网络服务器绝不应该使用交换空间,并尽可能避免磁盘访问。

NOTE It’s dangerous to do this on a general-purpose machine. If a machine completely runs out of both real memory and swap space, the Linux kernel invokes the out-of-memory (OOM) killer to kill a process in order to free up some memory. You obviously don’t want this to happen to your desktop applications. On the other hand, high-performance servers include sophisticated monitoring and load-balancing systems to ensure that they never reach the danger zone.

注意:在通用目的的机器上这样做是危险的。

如果一台机器完全用尽了实际内存和交换空间,Linux内核将调用OOM(内存不足)杀手来终止一个进程以释放一些内存。

显然,你不希望这种情况发生在你的桌面应用程序上。

另一方面,高性能服务器包括复杂的监控和负载平衡系统,以确保它们永远不会达到危险区域。

You’ll learn much more about how the memory system works in Chapter 8.

第 8 章将详细介绍内存系统的工作原理。

4.4 Looking Forward: Disks and User Space(展望未来: 磁盘和用户空间)

In disk-related components on a Unix system, the boundaries between user space and the kernel can be difficult to characterize. As you’ve seen, the kernel handles raw block I/O from the devices, and user-space tools can use the block I/O through device files. However, user space typically uses the block I/O only for initializing operations such as partitioning, file-system creation, and swap space creation. In normal use, user space uses only the filesystem support that the kernel provides on top of the block I/O. Similarly, the kernel also handles most of the tedious details when dealing with swap space in the virtual memory system.

在Unix系统的与磁盘相关的组件中,用户空间和内核之间的边界可能很难界定。

正如你所见,内核处理来自设备的原始块I/O,而用户空间工具可以通过设备文件使用块I/O。

然而,用户空间通常只在初始化操作(如分区、文件系统创建和交换空间创建)中使用块I/O。

在正常使用中,用户空间仅使用内核在块I/O之上提供的文件系统支持。

同样,内核在处理虚拟内存系统中的交换空间时也处理了大部分繁琐的细节。

The remainder of this chapter briefly looks at the innards of a Linux filesystem. This is more advanced material, and you certainly don’t need to know it to proceed with the book. If this is your first time through, skip to the next chapter and start learning about how Linux boots.

本章的其余部分简要介绍了Linux文件系统的内部结构。这是更高级的内容,你当然不需要了解它来继续阅读本书。

如果这是你第一次阅读,请跳到下一章并开始学习Linux的启动过程。

4.5 Inside a Traditional Filesystem(传统文件系统内部)

A traditional Unix filesystem has two primary components: a pool of data blocks where you can store data and a database system that manages the data pool. The database is centered around the inode data structure. An inode is a set of data that describes a particular file, including its type, permissions, and—perhaps most importantly—where in the data pool the file data resides. Inodes are identified by numbers listed in an inode table.

传统的Unix文件系统由两个主要组件组成:一个数据块池,用于存储数据,以及一个管理数据池的数据库系统。数据库围绕inode数据结构展开。inode是一组数据,描述了特定文件的信息,包括文件类型、权限,以及最重要的是文件数据在数据池中的位置。inode通过在inode表中列出的编号进行标识。

Filenames and directories are also implemented as inodes. A directory inode contains a list of filenames and corresponding links to other inodes.

文件名和目录也被实现为inode。目录inode包含文件名的列表,并对应链接到其他inode。

To provide a real-life example, I created a new filesystem, mounted it, and changed the directory to the mount point. Then, I added some files and directories with these commands (feel free to do this yourself with a flash drive):

为了提供一个现实生活中的例子,我创建了一个新的文件系统,挂载它,并将目录更改为挂载点。然后,我使用以下命令添加了一些文件和目录(可以随意使用闪存驱动器自己尝试):

$ mkdir dir_1
$ mkdir dir_2
$ echo a > dir_1/file_1
$ echo b > dir_1/file_2
$ echo c > dir_1/file_3
$ echo d > dir_2/file_4
$ ln dir_1/file_3 dir_2/file_5

Note that I created dir_2/file_5 as a hard link to dir_1/file_3, meaning that these two filenames actually represent the same file. (More on this shortly.)

请注意,我创建了 dir_2/file_5 作为 dir_1/file_3 的硬链接,这意味着这两个文件名实际上代表的是同一个文件。(稍后会有更多介绍)。

If you were to explore the directories in this filesystem, its contents would appear to the user as shown in Figure 4-4. The actual layout of the filesystem, as shown in Figure 4-5, doesn’t look nearly as clean as the user-level representation.

如果要查看该文件系统中的目录,用户会看到如图 4-4 所示的内容。

文件系统的实际布局如图 4-5 所示,看起来并不像用户级表示那样简洁。

Figure 4-4. User-level representation of a filesystem

Figure 4-4. User-level representation of a filesystem

图 4-4. 文件系统的用户级表示法

Figure 4-5. Inode structure of the filesystem shown in Figure 4-4

Figure 4-5. Inode structure of the filesystem shown in Figure 4-4

图4-5. 在图4-4中显示的文件系统的inode结构

How do we make sense of this? For any ext2/3/4 filesystem, you start at inode number 2—the root inode. From the inode table in Figure 4-5, you can see that this is a directory inode (dir), so you can follow the arrow over to the data pool, where you see the contents of the root directory: two entries named dir_1 and dir_2 corresponding to inodes 12 and 7633, respectively. To explore those entries, go back to the inode table and look at either of those inodes.

我们如何理解这个结构呢?对于任何一个ext2/3/4文件系统,你从inode编号2开始——根inode。

从图4-5中的inode表可以看出,这是一个目录inode(dir),因此你可以沿着箭头前往数据池,那里显示了根目录的内容:两个名为dir_1和dir_2的条目,分别对应inode 12和inode 7633。

要查看这些条目的内容,返回到inode表中查看其中任意一个inode。

To examine dir_1/file_2 in this filesystem, the kernel does the following:

要在这个文件系统中查看dir_1/file_2,内核执行以下操作:

  1. Determines the path’s components: a directory named dir_1, followed by a component named file_2.
  2. Follows the root inode to its directory data.
  3. Finds the name dir_1 in inode 2’s directory data, which points to inode number 12.
  4. Looks up inode 12 in the inode table and verifies that it is a directory inode.
  5. Follows inode 12’s data link to its directory information (the second box down in the data pool).
  6. Locates the second component of the path (file_2) in inode 12’s directory data. This entry points to inode number 14.
  7. Looks up inode 14 in the directory table. This is a file inode.
  8. 确定路径的组成部分:一个名为dir_1的目录,后面跟着一个名为file_2的组件。
  9. 跟随根inode到达其目录数据。
  10. 在inode 2的目录数据中找到名称为dir_1的条目,该条目指向inode编号12。
  11. 在inode表中查找inode 12并验证其为目录inode。
  12. 跟随inode 12的数据链接到达其目录信息(数据池中的第二个框)。
  13. 在inode 12的目录数据中找到路径的第二个组件(file_2)。该条目指向inode编号14。
  14. 在目录表中查找inode 14。这是一个文件inode。

At this point, the kernel knows the properties of the file and can open it by following inode 14’s data link. This system, of inodes pointing to directory data structures and directory data structures pointing to inodes, allows you to create the filesystem hierarchy that you’re used to. In addition, notice that the directory inodes contain entries for . (the current directory) and .. (the parent directory, except for the root directory). This makes it easy to get a point of reference and to navigate back down the directory structure.

此时,内核已经了解到了该文件的属性,并可以通过跟随inode 14的数据链接来打开它。

这个系统,即inode指向目录数据结构,目录数据结构指向inode,使得你可以创建你熟悉的文件系统层次结构。

此外,注意到目录inode中包含了.(当前目录)和..(父目录,根目录除外)的条目。这使得获取参考点和在目录结构中向下导航变得容易。

4.5.1 Viewing Inode Details(查看 Inode 详细信息)

To view the inode numbers for any directory, use the ls -i command. Here’s what you’d get at the root of this example. (For more detailed inode information, use the stat command.)

要查看任何目录的节点编号,请使用 ls -i 命令。

下面是本例中根目录的情况。

(要获得更详细的 inode 信息,请使用 stat 命令)。

$ ls -i
 12 dir_1 7633 dir_2

Now you’re probably wondering about the link count. You’ve already seen the link count in the output of the common ls -l command, but you likely ignored it. How does the link count relate to the files in Figure 4- 5, in particular the “hard-linked” file_5? The link count field is the number of total directory entries (across all directories) that point to an inode. Most of the files have a link count of 1 because they occur only once in the directory entries. This is expected: Most of the time when you create a file, you create a new directory entry and a new inode to go with it. However, inode 15 occurs twice: First it’s created as dir_1/file_3, and then it’s linked to as dir_2/file_5. A hard link is just a manually created entry in a directory to an inode that already exists. The ln command (without the -s option) allows you to manually create new links.

现在你可能对链接计数感到困惑。

你已经在常见的ls -l命令的输出中看到了链接计数,但你可能忽略了它。

链接计数与图4-5中的文件有什么关系,特别是“硬链接”文件_5?

链接计数字段是指指向一个索引节点的所有目录项(跨所有目录)的总数。

大多数文件的链接计数为1,因为它们在目录项中只出现一次。

这是预期的:大多数情况下,当你创建一个文件时,你会创建一个新的目录项和一个新的索引节点

然而,索引节点15出现了两次:首先它被创建为dir_1/file_3,然后它被链接为dir_2/file_5。

硬链接只是一个手动创建的目录项,指向已经存在的索引节点。

ln命令(不带-s选项)允许你手动创建新的链接。

This is also why removing a file is sometimes called unlinking. If you run rm dir_1/file_2, the kernel searches for an entry named file_2 in inode 12’s directory entries. Upon finding that file_2 corresponds to inode 14, the kernel removes the directory entry and then subtracts 1 from inode 14’s link count. As a result, inode 14’s link count will be 0, and the kernel will know that there are no longer any names linking to the inode. Therefore, it can now delete the inode and any data associated with it.

这也是为什么删除文件有时被称为取消链接。

如果你运行rm dir_1/file_2,内核会在索引节点12的目录项中搜索名为file_2的条目。

当找到file_2对应的是索引节点14时,内核会删除目录项,然后从索引节点14的链接计数中减去1。

结果,索引节点14的链接计数将变为0,内核将知道不再有任何名称链接到该索引节点。

因此,它现在可以删除索引节点和与之关联的任何数据。

However, if you run rm dir_1/file_3, the end result is that the link count of inode 15 goes from 2 to 1 (because dir_2/file_5 still points there), and the kernel knows not to remove the inode.

然而,如果你运行rm dir_1/file_3,最终的结果是索引节点15的链接计数从2变为1(因为dir_2/file_5仍然指向那里),内核知道不要删除该索引节点。

Link counts work much the same for directories. Observe that inode 12’s link count is 2, because there are two inode links there: one for dir_1 in the directory entries for inode 2 and the second a self-reference (.) in its own directory entries. If you create a new directory dir_1/dir_3, the link count for inode 12 would go to 3 because the new directory would include a parent (..) entry that links back to inode 12, much as inode 12’s parent link points to inode 2.

目录的链接计数工作方式与此类似。

观察到索引节点12的链接计数为2,因为有两个索引节点链接:一个是目录项2中的dir_1,另一个是自引用(.)在它自己的目录项中。

如果你创建一个新的目录dir_1/dir_3,索引节点12的链接计数将变为3,因为新的目录将包括一个指向索引节点12的父目录(..)条目,就像索引节点12的父链接指向索引节点2一样。

There is one small exception. The root inode 2 has a link count of 4. However, Figure 4-5 shows only three directory entry links. The “fourth” link is in the filesystem’s superblock because the superblock tells you where to find the root inode.

有一个小例外。根索引节点2的链接计数为4。

然而,图4-5只显示了三个目录项链接。

第“四”个链接在文件系统的超级块中,因为超级块告诉你如何找到根索引节点。

Don’t be afraid to experiment on your system. Creating a directory structure and then using ls -i or stat to walk through the pieces is harmless. You don’t need to be root (unless you mount and create a new filesystem).

不要害怕在你的系统上进行实验。创建一个目录结构,然后使用ls -i或stat来遍历各个部分是无害的。

你不需要成为root用户(除非你挂载和创建一个新的文件系统)。

But there’s still one piece missing: When allocating data pool blocks for a new file, how does the filesystem know which blocks are in use and which are available? One of the most basic ways is with an additional management data structure called a block bitmap. In this scheme, the filesystem reserves a series of bytes, with each bit corresponding to one block in the data pool. A value of 0 means that the block is free, and a 1 means that it’s in use. Thus, allocating and deallocating blocks is a matter of flipping bits.

但还有一个缺失的部分:当为新文件分配数据池块时,文件系统如何知道哪些块正在使用,哪些块可用?

其中最基本的方法之一是使用一个额外的管理数据结构,称为块位图。

在这种方案中,文件系统保留一系列字节,每个位对应数据池中的一个块。

值为0表示该块空闲,值为1表示该块正在使用。因此,分配和释放块只是翻转位。

Problems in a filesystem arise when the inode table data doesn’t match the block allocation data or when the link counts are incorrect; this can happen when you don’t cleanly shut down a system. Therefore, when you check a filesystem, as described in 4.2.11 Checking and Repairing Filesystems, the fsck program walks through the inode table and directory structure to generate new link counts and a new block allocation map (such as the block bitmap), and then it compares the newly generated data with the filesystem on the disk. If there are mismatches, fsck must fix the link counts and determine what to do with any inodes and/or data that didn’t come up when it traversed the directory structure. Most fsck programs make these “orphans” new files in the filesystem’s lost+found directory.

文件系统出现问题的原因是索引节点表数据与块分配数据不匹配,或链接计数不正确;当你不干净地关闭系统时,这种情况可能发生。

因此,在检查文件系统时,如4.2.11节“检查和修复文件系统”所述,fsck程序会遍历索引节点表和目录结构以生成新的链接计数和新的块分配图(如块位图),然后将新生成的数据与磁盘上的文件系统进行比较。

如果存在不匹配,fsck必须修复链接计数,并确定在遍历目录结构时未出现的任何索引节点和/或数据的处理方式。

大多数fsck程序将这些“孤立文件”作为文件系统的lost+found目录中的新文件。

4.5.2 Working with Filesystems in User Space(在用户空间中使用文件系统)

When working with files and directories in user space, you shouldn’t have to worry much about the implementation going on below them. You’re expected to access the contents of files and directories of a mounted file-system through kernel system calls. Curiously, though, you do have access to certain filesystem information that doesn’t seem to fit in user space—in particular, the stat() system call returns inode numbers and link counts.

在用户空间中处理文件和目录时,你不需要过于关注它们下面的实现细节。

你可以通过内核系统调用来访问已挂载文件系统中的文件和目录内容。

然而,有趣的是,你可以访问某些似乎不适合于用户空间的文件系统信息,特别是stat()系统调用返回的inode号和链接计数。

When not maintaining a filesystem, do you have to worry about inode numbers and link counts? Generally, no. This stuff is accessible to user mode programs primarily for backward compatibility. Furthermore, not all filesystems available in Linux have these filesystem internals. The Virtual File System (VFS) interface layer ensures that system calls always return inode numbers and link counts, but those numbers may not necessarily mean anything.

当不维护文件系统时,你需要关注inode号和链接计数吗?一般来说,不需要。

这些信息对于用户模式程序主要是为了向后兼容性而提供的。

此外,Linux中并非所有可用的文件系统都具有这些文件系统内部结构。

虚拟文件系统(VFS)接口层确保系统调用始终返回inode号和链接计数,但这些数字可能并不一定有实际意义。

You may not be able to perform traditional Unix filesystem operations on nontraditional filesystems. For example, you can’t use ln to create a hard link on a mounted VFAT filesystem because the directory entry structure is entirely different.

在非传统文件系统上可能无法执行传统的Unix文件系统操作。

例如,在已挂载VFAT文件系统上无法使用ln命令创建硬链接,因为目录条目结构完全不同。

Fortunately, the system calls available to user space on Unix/Linux systems provide enough abstraction for painless file access—you don’t need to know anything about the underlying implementation in order to access files. In addition, filenames are flexible in format and mixed-case names are supported, making it easy to support other hierarchical-style filesystems.

幸运的是,Unix/Linux系统中提供给用户空间的系统调用提供了足够的抽象,以便无痛访问文件-你不需要了解底层实现的任何信息。

此外,文件名的格式灵活,支持大小写混合命名,这使得支持其他分层式文件系统变得容易。

Remember, specific filesystem support does not necessarily need to be in the kernel. In user-space filesystems, the kernel only needs to act as a conduit for system calls.

请记住,特定的文件系统支持不一定需要在内核中。

在用户空间文件系统中,内核只需充当系统调用的传输通道。

4.5.3 The Evolution of Filesystems(文件系统的演变)

As you can see, even the simple filesystem just described has many different components to maintain. At the same time, the demands placed on filesystems continuously increase with new tasks, technology, and storage capacity. Today’s performance, data integrity, and security requirements are beyond the offerings of older filesystem implementations, so filesystem technology is constantly changing. We’ve already mentioned Btrfs as an example of a next-generation filesystem (see 4.2.1 Filesystem Types).

正如你所见,即使是刚刚描述的简单文件系统也有许多不同的组件需要维护。

与此同时,随着新任务、技术和存储容量的增加,对文件系统的需求也在不断增加。

如今的性能、数据完整性和安全性要求已超出了旧文件系统实现的能力,因此文件系统技术在不断变化。

我们已经提到Btrfs作为下一代文件系统的一个示例(见4.2.1 文件系统类型)。

One example of how filesystems are changing is that new filesystems use separate data structures to represent directories and filenames, rather than the directory inodes described here. They reference data blocks differently. Also, filesystems that optimize for SSDs are still evolving. Continuous change in the development of filesystems is the norm, but keep in mind that the evolution of filesystems doesn’t change their purpose.

文件系统的变化之一是,新的文件系统使用单独的数据结构来表示目录和文件名,而不是这里描述的目录inode。它们以不同的方式引用数据块。

此外,为SSD进行优化的文件系统仍在不断演化。

文件系统的开发中持续变化是常态,但请记住,文件系统的演化不会改变它们的目的。


Xander
195 声望49 粉丝