2

First clarify the requirements: use C language to determine whether the file is a text file or a binary file, or other compressed format files.

type of file

Linux under the system, everything is a file.
In order to manage all things as files, Linux the system divides the files into seven types, as follows:

type shorthand S_IFMT st_mode illustrate
block device b S_IFBLK S_ISBLK(m) The interface device that the system accesses data, such as a hard disk
character device c S_IFCHR S_ISCHR(m) Serial port interface devices such as keyboard, mouse, printer, tty terminal
content d S_IFDIR S_ISDIR(m) folder
link file l S_IFLNK S_ISLNK(m) Symbolic links, soft links and hard links
socket s S_IFSOCK S_ISSOCK(m) for network communication
normal file - S_IFREG S_ISREG(m) Divided into plain text files and binary files
named pipe p S_IFIFO S_ISFIFO(m) named pipe file

The third and fourth columns in the above table are Linux used under stat function to judge some macro definitions provided by the file type. To determine whether a file is a normal file, you can use the following code:

 stat(pathname, &sb);
if ((sb.st_mode & S_IFMT) == S_IFREG) {
   /* Handle regular file */
}

or directly use:

 stat(pathname, &sb);
if (S_ISREG(sb.st_mode)) {
    /* Handle regular file */
}

But our need is to determine whether the file is a text file or a binary file. Both of these belong to S_IFREG ordinary files, so the above method cannot be used to judge.

Universal file command

file command is Linux a built-in command to detect file type.
The general principle is to read the front of a file 1024 bytes, and then according to the rules of magic ( /etc/magic or /usr/share/misc/magic ) The file header is parsed and printed to the screen.
It is also very simple to use, just follow it directly file followed by the file name:

 [root@ck08 ~]# file anaconda-ks.cfg
anaconda-ks.cfg: ASCII text
[root@ck08 ~]# file tls.pcap
tls.pcap: tcpdump capture file (little-endian) - version 2.4 (Ethernet, capture length 262144)
[root@ck08 ~]# file zlib-1.2.11.tar.gz
zlib-1.2.11.tar.gz: gzip compressed data, was "zlib-1.2.11.tar", from Unix, last modified: Mon Jan 16 01:36:58 2017, max compression
[root@ck08 ~]# file /usr/bin/grep
/usr/bin/grep: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=bb5d89868c5a04ae48f76250559cb01fae1cd762, stripped

From the above example, it can be seen that the file command is very powerful and can almost identify the detailed type of the file, and even the specific information such as encoding, compression format, and size.
Therefore, this naming is in line with our needs.
However, what we need is the C language implementation, so we have to study the magic file header rules.

magic file rules

Each line in the file specifies a rule test to verify the file type, which is specified by 4 fields. They are offset , type , ---a7d74d84c8c9486f69aabf27b7962a9c--- , test , message .

  • offset

    • Specify the first number from the beginning of the file byte to start checking.
  • type

    • The data type to be checked, that is, what is the data type starting with offset that byte . For specific data types, please refer to magic(5) . Commonly used data types are

      • byte : a value of byte
      • short : two values of byte
      • long : four values of byte
      • string : String
  • test

    • test value. Used to test whether the offset under type is the test value. Use the numeric or character representation of the C language.
  • message

    • Information display for displaying inspection results.

    If type is a numeric type, then &value can be added after it, which means to perform an AND operation with the following test value first. type为字符串类型,则其后可跟/[Bbc]*/b表示忽略空格, ---1869cbda1f9a981432413a7c9fc78af1--- /c
    test数值类型,可以数值前<>&^ , ~ , respectively represent equality, less than, greater than, and operation, XOR operation, and negation operation.
    If the value of test is a string type, you can add , < , > in front of it.
    For example, the magic representation of an ELF file is:

     # ELF
    #0string        ELF        ELF
     0    string        \177ELF        ELF
    >4    byte        1        32-bit
    >4      byte            2               64-bit
    >5    byte        1        LSB
    >5    byte        2        MSB
    >16    short        0        unknown type
    >16    short        1        relocatable
    >16    short        2        executable
    >16    short        3        dynamic lib
    >16    short        4        core file
    >18    short        0        unknown machine
    >18    short        1        WE32100
    >18    short        2        SPARC
    >18    short        3        80386
    >18    short        4        M68000
    >18    short        5        M88000
    >20    long        1        Version 1
    >36    long        1        MathCoPro/FPU/MAU Required

magic实在不是人类能够轻易读懂的, Linux内核libmagic库用来解析magic文件, But I tried CentOS 7 and Ubuntu20.04 , but I didn't succeed in running the program ( gcc the compilation report could not find magic.h ) My requirement is to require a more general method, not only to work on Linux , but also to have better performance on Windows and AIX , therefore, attempts to implement a set of principles similar to file not work.
So, isn't there a more general solution to determine the file type?
Many information on the Internet say that it can be judged according to the characters of the file. If the file contains \x00 , it must be a binary or compressed file, otherwise it is a normal text file.
Most of the time, this rule holds true. But if the encoding of the normal text file is UTF-16 or UTF-32 , then you will cry in the toilet again.
Therefore, this scheme is not reliable.

special file header

The idea of libmagic , to put it bluntly, is to judge according to the encoding of the file header, that is to say, as long as we know some special file header encoding, these special file headers are matched. , it means that it is a special file, otherwise, it is a normal text file. According to this idea, it can also achieve the same effect as the libmagic library.
In this article on standard encoding of various types of file headers , some common file header encodings are listed. jar包、 rarzip压缩文件,都是以504B0304 19c87b97e2d7a79a6ed829a196b1c674---开头, Linux下的二进制文件,包括.o , .a , .so ,以及coredump文件, ELF The file, the file header is 7F454C46 . However, the executable file of windows f64a86c6d16629b47493c53250735066--- starts with 504B0304 , AIX the system is more complicated, but the first three bytes are basically 01DF00 . Therefore, based on these, many distinctions can be made.
In fact, for the windows system, it can actually be distinguished according to the suffix, while the Unix system's suffix rules can also distinguish many files, such as a file with a suffix of .rpm , you will not treat it as a text file anyway, see .o it is a binary object file, .so is a dynamic link library. The more ambiguous may be just some executable files, such as ls , grep , a.out these suffixes do not represent actual files.
Therefore, our thinking is also clear. It is divided into two steps. First, some particularly obvious binary files and compressed files can be distinguished roughly according to the file suffix, and then further distinction can be made according to the file header .

C language code implementation

The code is implemented as follows:

 #include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef int boolean;
#define FALSE 0
#define TRUE  1

/*列举一些常见的文件头,可以自行扩展,比如放到某个配置文件中*/
static const char *with_suffix[] = {".gz", ".rar", ".exe", ".bz2",
                                ".tar", ".xz", ".Z", ".rpm", ".zip",
                                ".a",   ".so", ".o", ".jar", ".dll",
                                ".lib", ".deb", ".I", ".png",".jpg",
                                ".mp3", ".mp4", ".m4a", ".flv", ".mkv",
                                ".rmvb", ".avi",  ".pcap", ".pdf", ".docx",
                                ".xlsx", ".pptx", ".ram", ".mid", ".dwg",
                                NULL};

/*判断某个字符串是否拥有指定后缀*/
static boolean string_has_suffix(const char *str, const char *suffix) {
    int n, m, i = 0;
    char ch = '\0';
    if (str == NULL || suffix == NULL)
    {
        return FALSE;
    }
    n = strlen(str);
    m = strlen(suffix);
    if (n < m) {
        return FALSE;
    }
    for (i = m-1; i >= 0;  i--) {
        if (suffix[i] != str[n - m + i]) {
            return FALSE;
        }
    }
    return TRUE;
}

/*判断文件是否具有特殊后缀*/
static boolean file_has_spec_suffix(const char *fname) {
   const char **suffix = NULL;
   suffix = with_suffix;
   while (*suffix)
   {
      if (string_has_suffix(fname, *suffix))
      {
         return TRUE;
      }
      suffix++;
   }
   return FALSE;
}

/*判断文件是否具有特殊文件头*/
static boolean file_has_spec_header(const char *fname) {
    FILE *fp = NULL;
    size_t len = 0;
    char buf[16] = {0};
    int i = 0;
    boolean retval = FALSE;
    if ((fp = fopen(fname, "r")) == NULL ){
       return FALSE;
    }

    len = sizeof(buf) - 1;
    if (fgets(buf, len, fp) == NULL )  {
       return FALSE;
    }
    if (len < 4)
    {
       return FALSE;
    }
#if defined(__linux__)
    //ELF header
    if (memcmp(buf, "\x7F\x45\x4C\x46", 4) == 0) {
        return TRUE;
    }
#elif defined(_AIX)
    //executable binary
    if (memcmp(buf, "\x01\xDF\x00", 3) == 0) {
        return TRUE;
    }
#elif defined(WIN32)
    // standard exe file, actually, won't go into this case
    if (memcmp(buf, "\x4D\x5A\x90\x00", 4) == 0)
    {
        return TRUE;
    }
#endif
    if (memcmp(buf, "\x50\x4B\x03\x04", 4) == 0) {
        //maybe archive file, eg: jar zip rar sec.
        return TRUE;
    }

    return FALSE;
}


/*测试程序
* 从命令行输入一个文件,返回该文件的类型
*/
int main(int argc, const char **argv) {
   if (argc < 2) {
      printf("usgae: need target file\n");
      exit(-1);
   }
   const char *fname = argv[1];

   if (file_has_spec_suffix(fname)) {
      printf("file %s have special suffix, maybe it's a binary or archive file\n", fname);
   } else if (file_has_spec_header(fname)) {
      printf("file %s have special header, maybe it's a binary or archive file\n", fname);
   } else {
      printf("file %s should be a text file\n", fname);
   }
   return 0;
}

The running result is as follows, you can compare the file command for a reference:

 [root@ck08 ctest]# gcc -o magic magic.c 
[root@ck08 ctest]# ./magic ~/anaconda-ks.cfg
file /root/anaconda-ks.cfg should be a text file
[root@ck08 ctest]# ./magic ~/tls.pcap
file /root/tls.pcap have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic ~/zlib-1.2.11.tar.gz
file /root/zlib-1.2.11.tar.gz have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic /usr/bin/grep
file /usr/bin/grep have special header, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic kafka_2.12-2.8.0.jar
file kafka_2.12-2.8.0.jar have special suffix, maybe it's a binary or archive file
[root@ck08 ctest]# ./magic kafka_2.12-2.8.0.jar.1
file kafka_2.12-2.8.0.jar.1 have special header, maybe it's a binary or archive file

Reference documentation


禹鼎侯
176 声望466 粉丝

OLAP数据库开发。跨平台数据采集。