As we all know, the data and metadata of cephfs are separated and located in different pools, and the storage of the underlying data pools of rgw, cephfs and rbd is based on RADOS (Reliable Autonomic Distributed Object Store), the basis of Ceph storage cluster , Everything in the RADOS layer is stored in the form of objects, whether the upper layer is a file, an object or a block.
It mainly talks about how the files of cephfs are distributed in the data pool. The following takes the kernel client of cephfs as an example for analysis.
Take the following file as an example, first find the inode number 1099511627776 of the file, convert it to 10000000000 in hexadecimal, and find the object from the data pool. The name of the object is the inode number of the file. Number (data location/object size object_size (default) 4M), numbered from 0)
Cephfs read data conversion osd request
struct ceph_file_layout {
/* file -> object mapping */
u32 stripe_unit; /* stripe unit, in bytes */
u32 stripe_count; /* over this many objects */
u32 object_size; /* until objects are this big */
s64 pool_id; /* rados pool id */
struct ceph_string __rcu *pool_ns; /* rados pool namespace */
};
The layout attribute of the file, located in the inode attribute of each file, is the condition used to calculate the actual object distribution of the file.
The client (linux kernel client) is located in the extended attribute xattr of the file, and is modified through the set_xattr process (the client entry function __ceph_setxattr and the server mds entry function is Server::handle_client_setxattr).
You can use the getxattr command to view the default value, the command and echo are as follows:
getfattr -n ceph.file.layout ceph-debuginfo-Lakestor_v2.1.0.18-0.el7.x86_64.rpm
ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs-data”
The read process initiates a request from the linux kernel client to the osd server:
The entry function is
|__ceph_read_iter
|____ceph_get_caps (to get the cap of the file, if the cap does not meet the conditions, send the req of getattr to the mds server to get the latest inode metadata, when the server completes the process and returns the request, the client uses handle_reply to finally fill_inode, and the latest metadata information Fill in the client inode cache)
|______ceph_direct_read_write (if the client dcache does not have a corresponding location cache, send a request to osd to get it)
|__ceph_direct_read_write:
|____ceph_osdc_new_request(calc_layout-|______ceph_calc_file_object_mapping[ceph_oid_printf(&req->r_base_oid, "%llx.%08llx", vino.ino, objnum);])
|________ceph_osdc_start_request
...
|__ceph_osdc_new_request:
|____calc_layout(layout, off, plen, &objnum, &objoff, &objlen):
|______ceph_calc_file_object_mapping
Important input parameters: 1, the layout structure in the inode, 2, the offset position to write off and 3, the length plen to write.
Output parameters: 1, the object number to be written objnum, 2, the offset objoff, within the object.
|________ceph_oid_printf(&req->r_base_oid, "%llx.%08llx", vino.ino, objnum);(The inode number and the object number are spliced together to form the object name to be written)
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。