3

Original link: He Xiaodong's blog

It can be regarded as a common interview question, and some big companies also like to use this question as an interview question.

Question: For example, there is a 1g file that stores these out-of-order and non-unique numbers. If 100m is used to complete the overall sorting?

implementation process of is 160e72115b1e0d:

  1. First read the large file line by line, one group per 10000 lines, and then sorted and written into the file , the file name is similar to t1.txt, t2.txt... until the entire reading and splitting are completed file,
  2. Then traverse all files, read the first line of each file, into the temporary sorting array $tmpNums**, then take the smallest value, ** put it into the temporary storage array $nums , and record the index value of the current position $idx
  3. Take a number from the file corresponding to the index where the minimum value is located, and put it into the current index of the temporary sorting array, and then continue to repeat the operation of step 2 until all the contents of all files are read, and the overall file reordering is completed.

The following is a PHP multi-channel merge sorting demo code, just a simple skeleton structure, for example, min takes the minimum part, can be expanded to a fixed-length minimum heap implementation, or a priority queue, which can save the value and the file in which it is .

PHP multi-channel merge demo code :

function multiWaySort()
{
    // 读取所有的文件描述符

    $fds = [];
    $dir = scandir('./runtime/txt/');
    foreach ($dir as $file) {
        if ($file != '.' && $file != '..') {
            $fds[] = fopen('./runtime/txt/' . $file, 'rb');
        }
    }

    // 读取每个文件的第一行内容,放入临时排序数组

    $tmpNums = [];
    foreach ($fds as $fd) {
        $tmpNums[] = (int)fgets($fd);
    }

    $nums = [];
    $count = 0;
    while (true) {
        if (empty($tmpNums)) break;

        // 最小值放入临时存储数组

        $value = min($tmpNums);
        $nums[] = $value;  

        // 读取最小值所在索引,对应的文件下一行内容

        $idx = array_search($value, $tmpNums);
        $next = fgets($fds[$idx]);

        if (!$next) {
            unset($tmpNums[$idx]);
            unset($fds[$idx]);
        } else {
            $tmpNums[$idx] = (int)$next;
        }

        // 临时存储数组到达一定数量追加写入文件一次

        if (count($nums) == 20) {
            foreach ($nums as $value) {
                $f = fopen('./runtime/result.txt', 'ab+');
                fwrite($f, $value . PHP_EOL);
            }
            $nums = [];
        }

        if ($count == 4999999) {
            continue;
        }

        $count++;
    }
}

Reference link:

  1. Large file sorting/external memory sorting problem
  2. actual combat k-way merge sort

Finally, just Cloud's full range of products/SMS packages are the best choice for purchasing small and medium-sized enterprises to the cloud.


hxd_
1.7k 声望448 粉丝