First clarify the requirements: use C
language to determine whether the file is a text file or a binary file, or other compressed format files.
type of file
Linux
under the system, everything is a file.
In order to manage all things as files, Linux
the system divides the files into seven types, as follows:
type | shorthand | S_IFMT | st_mode | illustrate |
---|---|---|---|---|
block device | b | S_IFBLK | S_ISBLK(m) | The interface device that the system accesses data, such as a hard disk |
character device | c | S_IFCHR | S_ISCHR(m) | Serial port interface devices such as keyboard, mouse, printer, tty terminal |
content | d | S_IFDIR | S_ISDIR(m) | folder |
link file | l | S_IFLNK | S_ISLNK(m) | Symbolic links, soft links and hard links |
socket | s | S_IFSOCK | S_ISSOCK(m) | for network communication |
normal file | - | S_IFREG | S_ISREG(m) | Divided into plain text files and binary files |
named pipe | p | S_IFIFO | S_ISFIFO(m) | named pipe file |
The third and fourth columns in the above table are Linux
used under stat
function to judge some macro definitions provided by the file type. To determine whether a file is a normal file, you can use the following code:
stat(pathname, &sb);
if ((sb.st_mode & S_IFMT) == S_IFREG) {
/* Handle regular file */
}
or directly use:
stat(pathname, &sb);
if (S_ISREG(sb.st_mode)) {
/* Handle regular file */
}
But our need is to determine whether the file is a text file or a binary file. Both of these belong to S_IFREG
ordinary files, so the above method cannot be used to judge.
Universal file command
file
command is Linux
a built-in command to detect file type.
The general principle is to read the front of a file 1024
bytes, and then according to the rules of magic
( /etc/magic
or /usr/share/misc/magic
) The file header is parsed and printed to the screen.
It is also very simple to use, just follow it directly file
followed by the file name:
[root@ck08 ~]# file anaconda-ks.cfg
anaconda-ks.cfg: ASCII text
[root@ck08 ~]# file tls.pcap
tls.pcap: tcpdump capture file (little-endian) - version 2.4 (Ethernet, capture length 262144)
[root@ck08 ~]# file zlib-1.2.11.tar.gz
zlib-1.2.11.tar.gz: gzip compressed data, was "zlib-1.2.11.tar", from Unix, last modified: Mon Jan 16 01:36:58 2017, max compression
[root@ck08 ~]# file /usr/bin/grep
/usr/bin/grep: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=bb5d89868c5a04ae48f76250559cb01fae1cd762, stripped
From the above example, it can be seen that the file
command is very powerful and can almost identify the detailed type of the file, and even the specific information such as encoding, compression format, and size.
Therefore, this naming is in line with our needs.
However, what we need is the C
language implementation, so we have to study the magic
file header rules.
magic file rules
Each line in the file specifies a rule test to verify the file type, which is specified by 4
fields. They are offset
, type
, ---a7d74d84c8c9486f69aabf27b7962a9c--- , test
, message
.
offset
- Specify the first number from the beginning of the file
byte
to start checking.
- Specify the first number from the beginning of the file
type
The data type to be checked, that is, what is the data type starting with
offset
thatbyte
. For specific data types, please refer tomagic(5)
. Commonly used data types are-
byte
: a value ofbyte
-
short
: two values ofbyte
-
long
: four values ofbyte
-
string
: String
-
test
- test value. Used to test whether the
offset
undertype
is thetest
value. Use the numeric or character representation of theC
language.
- test value. Used to test whether the
message
- Information display for displaying inspection results.
If
type
is a numeric type, then&value
can be added after it, which means to perform an AND operation with the followingtest
value first.type
为字符串类型,则其后可跟/[Bbc]*
,/b
表示忽略空格, ---1869cbda1f9a981432413a7c9fc78af1---/c
。
test
数值类型,可以数值前=
,<
,>
,&
,^
,~
, respectively represent equality, less than, greater than, and operation, XOR operation, and negation operation.
If the value oftest
is a string type, you can add=
,<
,>
in front of it.
For example, the magic representation of an ELF file is:# ELF #0string ELF ELF 0 string \177ELF ELF >4 byte 1 32-bit >4 byte 2 64-bit >5 byte 1 LSB >5 byte 2 MSB >16 short 0 unknown type >16 short 1 relocatable >16 short 2 executable >16 short 3 dynamic lib >16 short 4 core file >18 short 0 unknown machine >18 short 1 WE32100 >18 short 2 SPARC >18 short 3 80386 >18 short 4 M68000 >18 short 5 M88000 >20 long 1 Version 1 >36 long 1 MathCoPro/FPU/MAU Required
magic
实在不是人类能够轻易读懂的, Linux
内核libmagic
库用来解析magic
文件, But I tried CentOS 7
and Ubuntu20.04
, but I didn't succeed in running the program ( gcc
the compilation report could not find magic.h
) My requirement is to require a more general method, not only to work on Linux
, but also to have better performance on Windows
and AIX
, therefore, attempts to implement a set of principles similar to file
not work.
So, isn't there a more general solution to determine the file type?
Many information on the Internet say that it can be judged according to the characters of the file. If the file contains \x00
, it must be a binary or compressed file, otherwise it is a normal text file.
Most of the time, this rule holds true. But if the encoding of the normal text file is UTF-16
or UTF-32
, then you will cry in the toilet again.
Therefore, this scheme is not reliable.
special file header
The idea of libmagic
, to put it bluntly, is to judge according to the encoding of the file header, that is to say, as long as we know some special file header encoding, these special file headers are matched. , it means that it is a special file, otherwise, it is a normal text file. According to this idea, it can also achieve the same effect as the libmagic
library.
In this article on standard encoding of various types of file headers , some common file header encodings are listed. jar
包、 rar
、 zip
压缩文件,都是以504B0304
19c87b97e2d7a79a6ed829a196b1c674---开头, Linux
下的二进制文件,包括.o
, .a
, .so
,以及coredump
文件, ELF
The file, the file header is 7F454C46
. However, the executable file of windows
f64a86c6d16629b47493c53250735066--- starts with 504B0304
, AIX
the system is more complicated, but the first three bytes are basically 01DF00
. Therefore, based on these, many distinctions can be made.
In fact, for the windows
system, it can actually be distinguished according to the suffix, while the Unix
system's suffix rules can also distinguish many files, such as a file with a suffix of .rpm
, you will not treat it as a text file anyway, see .o
it is a binary object file, .so
is a dynamic link library. The more ambiguous may be just some executable files, such as ls
, grep
, a.out
these suffixes do not represent actual files.
Therefore, our thinking is also clear. It is divided into two steps. First, some particularly obvious binary files and compressed files can be distinguished roughly according to the file suffix, and then further distinction can be made according to the file header
.
C language code implementation
The code is implemented as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef int boolean;
#define FALSE 0
#define TRUE 1
/*列举一些常见的文件头,可以自行扩展,比如放到某个配置文件中*/
static const char *with_suffix[] = {".gz", ".rar", ".exe", ".bz2",
".tar", ".xz", ".Z", ".rpm", ".zip",
".a", ".so", ".o", ".jar", ".dll",
".lib", ".deb", ".I", ".png",".jpg",
".mp3", ".mp4", ".m4a", ".flv", ".mkv",
".rmvb", ".avi", ".pcap", ".pdf", ".docx",
".xlsx", ".pptx", ".ram", ".mid", ".dwg",
NULL};
/*判断某个字符串是否拥有指定后缀*/
static boolean string_has_suffix(const char *str, const char *suffix) {
int n, m, i = 0;
char ch = '\0';
if (str == NULL || suffix == NULL)
{
return FALSE;
}
n = strlen(str);
m = strlen(suffix);
if (n < m) {
return FALSE;
}
for (i = m-1; i >= 0; i--) {
if (suffix[i] != str[n - m + i]) {
return FALSE;
}
}
return TRUE;
}
/*判断文件是否具有特殊后缀*/
static boolean file_has_spec_suffix(const char *fname) {
const char **suffix = NULL;
suffix = with_suffix;
while (*suffix)
{
if (string_has_suffix(fname, *suffix))
{
return TRUE;
}
suffix++;
}
return FALSE;
}
/*判断文件是否具有特殊文件头*/
static boolean file_has_spec_header(const char *fname) {
FILE *fp = NULL;
size_t len = 0;
char buf[16] = {0};
int i = 0;
boolean retval = FALSE;
if ((fp = fopen(fname, "r")) == NULL ){
return FALSE;
}
len = sizeof(buf) - 1;
if (fgets(buf, len, fp) == NULL ) {
return FALSE;
}
if (len < 4)
{
return FALSE;
}
#if defined(__linux__)
//ELF header
if (memcmp(buf, "\x7F\x45\x4C\x46", 4) == 0) {
return TRUE;
}
#elif defined(_AIX)
//executable binary
if (memcmp(buf, "\x01\xDF\x00", 3) == 0) {
return TRUE;
}
#elif defined(WIN32)
// standard exe file, actually, won't go into this case
if (memcmp(buf, "\x4D\x5A\x90\x00", 4) == 0)
{
return TRUE;
}
#endif
if (memcmp(buf, "\x50\x4B\x03\x04", 4) == 0) {
//maybe archive file, eg: jar zip rar sec.
return TRUE;
}
return FALSE;
}
/*测试程序
* 从命令行输入一个文件,返回该文件的类型
*/
int main(int argc, const char **argv) {
if (argc < 2) {
printf("usgae: need target file\n");
exit(-1);
}
const char *fname = argv[1];
if (file_has_spec_suffix(fname)) {
printf("file %s have special suffix, maybe it's a binary or archive file\n", fname);
} else if (file_has_spec_header(fname)) {
printf("file %s have special header, maybe it's a binary or archive file\n", fname);
} else {
printf("file %s should be a text file\n", fname);
}
return 0;
}
The running result is as follows, you can compare the file command for a reference:
[root@ck08 ctest]# gcc -o magic magic.c
[root@ck08 ctest]# ./magic ~/anaconda-ks.cfg
file /root/anaconda-ks.cfg should be a text file
[root@ck08 ctest]# ./magic ~/tls.pcap
file /root/tls.pcap have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic ~/zlib-1.2.11.tar.gz
file /root/zlib-1.2.11.tar.gz have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic /usr/bin/grep
file /usr/bin/grep have special header, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic kafka_2.12-2.8.0.jar
file kafka_2.12-2.8.0.jar have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic kafka_2.12-2.8.0.jar.1
file kafka_2.12-2.8.0.jar.1 have special header, maybe it's a binary or archive file
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。