How to handle large XLSX/CSV/TXT files?

During the development process, we may encounter such requirements. We need to parse the information from local files such as Excel or CSV. These information may be attendance punch records, calendar information, or recent bills. However, their common feature is that the data is large and complicated, the workload of manual input is huge and error-prone, and it takes a lot of time. Is there any way to automatically parse the file and get useful information?

When the amount of data in this file is not a lot, there are many front-end tools to choose from. For example, SheetJS provides many methods to parse useful information from Excel and CSV, which is very convenient.

When the amount of data is only a few thousand pieces, there are many options, but once the amount of data increases, the processing becomes complicated. If the amount of XLSX/CSV data reaches 100w+, it will take a long time for Office and WPS to open and take a look.

Then how to parse data from such a large volume of Excel/CSV/TXT?

Background

Below we describe the whole process of understanding through a hypothetical requirement. Suppose we need to parse data from local Excel, CSV, TXT (or other formats) files, and store them in a local database file after cleaning. But these files may be 5M, 50M, 500M or even larger. So how to upload in the browser environment? How should it be parsed in the Node environment?

First of all, what we need to understand is how do browser web pages upload large files?

How to upload large files on web page?

Web pages can generally upload large files, but there is a problem. If the data to be uploaded is relatively large, the entire uploading process will take a long time. Coupled with the uncertainties of the uploading process, once it fails, the entire uploading will have to start all over again, which takes a long time.

Faced with this problem, we can solve the problem by dividing the large file into multiple small files and uploading only one copy at a time. This way, even if a request fails, you don't need to start from scratch, just re-upload the failed copy.

If we want to use this method, we need to meet the following requirements:

Large files support slice upload
Can be resumed from a breakpoint
You can know the upload progress

Let's first look at how to do large file cutting. Web pages basically get local files through <input type='file' />. The file obtained through the input event.target.files is actually an instance of the File class, which is a subclass of the Blob class.

A Blob object represents an immutable, raw data file-like object. Its data can be read in text or binary format, and can also be converted into a ReadableStream for data manipulation. A simple understanding of unity regards Blob as a binary container, which means that a large binary file is stored. The Blob object has a very important method: slice(). It should be noted here that the Blob object is immutable, and the slice method returns a new Blob, which represents the binary file that needs to be cut.

The slice() method accepts three parameters, a start offset, an end offset, and an optional mime type. If the mime type is not set, then the new blob object has the same mime type as the parent. The File interface is based on Blob, and the File object also includes a slice method, the result of which contains the specified range of data in the source Blob object.

After reading the cutting method, we can split the binary file. A split example is as follows:

function sliceInPiece(file, piece = 1024 * 1024 * 5) {
  let totalSize = file.size; // 文件总大小
  let start = 0; // 每次上传的开始字节
  let end = start + piece; // 每次上传的结尾字节
  let chunks = []
  while (start < totalSize) {
    // 根据长度截取每次需要上传的数据
    // File对象继承自Blob对象，因此包含slice方法
    let blob = file.slice(start, end); 
    chunks.push(blob)

    start = end;
    end = start + piece;
  }
  return chunks
}

After obtaining the cut array of files, you can call the interface one by one to upload to the server.


let file =  document.querySelector("[name=file]").files[0];

const LENGTH = 1024 * 1024 * 0.1;
let chunks = sliceInPiece(file, LENGTH); // 首先拆分切片

chunks.forEach(chunk=>{
  let fd = new FormData();
  fd.append("file", chunk);
  post('/upload', fd)
})

After uploading, go to the server to splicing the sliced files into a complete file, and let the FileReader object read the data from the Blob.

Of course, there will be two problems here. One is that in the face of a bunch of sliced files that have been uploaded, how should the server know the correct order of them? The second is that if multiple large files are uploaded at the same time, how does the server determine which slice belongs to which file?

We can deal with the problem of front and back order by adding parameters when constructing the sliced FormData. For example, use the parameter ChunkIndex to indicate the order of the current slice.

For the second problem, you can mark the source of the original file by adding parameters such as sourceFile (the value can be the full path of the current large file or more strictly the hash value of the file). In this way, when the server obtains data, it can know which slices come from which file and the sequence between slices.

If it is temporarily inconvenient to build by yourself, you can also consider using cloud services. For example, Cloud Storage supports large file upload and breakpoint resuming. for example:

Breakpoint resume

When uploading a large file or uploading a file on the mobile terminal, the upload fails due to network quality, excessive transmission time, etc., you can use a breakpoint to resume the upload. In particular, images uploaded by resuming uploads do not support preprocessing. In particular, the files uploaded by resuming the upload cannot be overwritten by other uploading methods. If overwriting is required, the files must be deleted first.

Name Concept

File chunking: Directly split binary files into small chunks. The chunk size is fixed at 1M. Except for the last chunk.
Upload stage: Use the x-upyun-multi-stage parameter to indicate the stage of resuming the upload. It is divided into the following three stages: initate (upload initialization), upload (uploading), complete (uploading end). Each stage is carried out sequentially.
Partition serial number: Use the x-upyun-part-id parameter to indicate the current partition serial number, which starts from 0.
Sequential upload: For the same breakpoint resuming upload task, only sequential upload is supported.
Upload ID: Use the x-upyun-multi-uuid parameter to uniquely identify an upload task, the type is a string, and the length is 36 bits.
Upload and cleanup: The unfinished files will be saved for 24 hours after resuming the upload. After that, the files will be deleted.

It can be seen that cloud storage solves the two problems mentioned above by using the shard serial number x-upyun-part-id and the upload ID x-upyun-multi-uuid . It should be noted here that these two data are not generated by the front-end itself, but are returned through the responseHeader after the initial upload.

又拍云初始化断点续传

The previous article is all about how to upload large files using Web pages. Next, let's take a look at how NodeJS parses and processes such large files?

NodeJS parsing large files

First of all, we need to clarify a concept. There is no File object in NodeJS, but there is an fs (file system) module. The fs module supports interaction with the file system in a manner modeled by standard POSIX functions. \

POSIX is an acronym for Portable Operating System Interface of UNIX . Simply put, POSIX provides a unified calling interface under the operating systems provided by different kernels, such as opening files under linux and opening files under widnows. Maybe the method provided by the kernel is different, but because fs supports the POSIX standard, for programmers, no matter what the kernel provides, simply call the fsPromises.open(path, flags[, mode]) method in Node. can use.

Here is a simple example with Vue. The way Vue generates page elements in different environments, such as Web pages or Weex, etc., is different. For example, createElement under Web is as follows:


export function createElement (tagName: string, vnode: VNode): Element {
  const elm = document.createElement(tagName)
  if (tagName !== 'select') {
    return elm
  }
  // false or null will remove the attribute but undefined will not
  if (vnode.data && vnode.data.attrs && vnode.data.attrs.multiple !== undefined) {
    elm.setAttribute('multiple', 'multiple')
  }
  return elm
}

Under Weex, the situation is as follows:

export function createElement (tagName: string): WeexElement {
  return document.createElement(tagName)
}

createElement in the above two cases is different. Similarly, there are many other ways to create modules or elements, but for different platforms, Vue provides the same patch method to update or create components.

import * as nodeOps from 'web/runtime![]()de-ops'\
import { createPatchFunction } from 'core![]()dom/patch'\
import baseModules from 'core![]()dom/modules/index'\
import platformModules from 'web/runtime/modules/index'\
\
// the directive module should be applied last, after all\
// built-in modules have been applied.\
const modules = platformModules.concat(baseModules)\
\
// nodeops 封装了一系列DOM操作方法。modules定义了一些模块的钩子函数的实现\
export const patch: Function = createPatchFunction({ nodeOps, modules })

import * as nodeOps from 'weex/runtime![]()de-ops'\
import { createPatchFunction } from 'core![]()dom/patch'\
import baseModules from 'core![]()dom/modules/index'\
import platformModules from 'weex/runtime/modules/index'\
\
// the directive module should be applied last, after all\
// built-in modules have been applied.\
const modules = platformModules.concat(baseModules)\
\
export const patch: Function = createPatchFunction({\
  nodeOps,\
  modules,\
  LONG_LIST_THRESHOLD: 10\
})

In this way, regardless of whether the internal implementation of the runtime environment is different, the same patch method can be called. The POSIX philosophy is the same as the above example.

With a brief understanding of POSIX, we return to the fs module. The fs module provides many methods for reading files, such as:

fs.read(fd, buffer, offset, length, position, callback) reads file data. To operate a file, you must first open the file. The fd of this method is the file descriptor returned by calling fs.open.
fs.readFile(path[, options], callback) Asynchronously reads the entire contents of the file. It can be seen as a further encapsulation of fs.read.

The usage scenarios are as follows:

import { readFile } from 'fs';

readFile('/etc/passwd','utf-8', (err, data) => {
  if (err) throw err;
  console.log(data);
});

Because the fs.readFile function will buffer the entire file, it is fine if the file to be read is small, but if the file is large, it will put pressure on the memory. Is there a way to read the file with less memory pressure?

Yes, our protagonist stream stream debuts today.

stream

stream Stream is an abstract interface for working with streaming data in Node.js. The stream module provides an API for implementing the stream interface. Streams can be readable, writable, or both.

There is a fs.createReadStream(path[, options]) method in the fs module, which returns a readable stream with a default size of 64k, which is a buffer of 64k. Once the internal read buffer reaches this threshold, the stream will temporarily stop reading data from the underlying resource until the currently buffered data is consumed.

The way to consume data can be by calling the pipe() method, or it can be consumed directly by events.

// pipe 消费
readable.pipe(writable)

// 或者
// 事件消费
readable.on('data', (chunk) => {
  writable.write(chunk);
});
readable.on('end', () => {
  writable.end();
});

In addition to readable streams, there are also writable streams fs.createWriteStream(path[, options]), which can write data to a file.

Well, the required pre-knowledge is basically introduced, back to the topic. Suppose we have a folder with dozens of XLSX/CSV files, each of which is over 500M in size. How to read information from these files and write it into database files?

Batch parsing CSV files

Assuming that the path of the file we need to parse is already known, and the file can be obtained through the path, then save these paths into an array and name it needParseArr, we need to parse these CSV, XLSX file information one by one in order, and clean and then write into the database.

First of all, it is the logic of reading one by one (readOneByOne).

async readOneByOne () {
   try {
    for (let i = 0; i < needParsePathArr.length; i++) {
      const filePath = needParsePathArr[i]
      console.log(`解析到第${i}个文件，文件名：${filePath}`)
      await streamInsertDB(filePath)
    }
  } catch (err) {

  }
}

streamInsertDB is the entry point for our main logic.

async function streamInsertDB (filePath) {
  return new Promise((resolve, reject) => {
    const ext = path.extname(filePath)
    // 判断了下文件类型
    if (ext === '.csv') {
      // 解析csv
      parseAndInsertFromCSV(filePath, resolve, reject)
    } else if (ext === '.xlsx') {
      // 自执行函数
      (async function getName () {
        try {
          // 先转换成csv。也可以不转换，直接解析xlsx，后文会详细解释。
          const csvFileName = await convertXlsx2Csv(filePath)
          // 复用解析csv的逻辑
          parseAndInsertFromCSV(csvFileName, resolve, reject)
        } catch (error) {
          reject(`error: ${error.message || error}`)
        }
      })()
    }
  })
}

parseAndInsertFromCSV is the main position for using the knowledge points we mentioned earlier. The following is a brief introduction to each function:

chardet: The function of this function is to monitor the encoding format of the CSV file. After all, not every CSV is encoded in UTF-8. The encoding type of CSV with Chinese may be GBK or GB18030, GB18031, etc. This format is not processed directly. Read, the Chinese will be displayed as garbled characters. So you need to perform the conversion function iconv to convert it.
Pipe: It can be used to establish a pipeline chain. It can be understood that the role of pipe is like a pipeline, which can read and write to the target stream. Here we are decoding and re-encoding.
insertInBlock: This function is after a certain amount of data is obtained (in this case, about 30,000 pieces of data are parsed from CSV), pause for a while to perform some operations, such as writing to the database or filtering the data inside, Processing, etc., is determined according to actual needs.
csv: The function of this function is to read the specific data in the stream.

For specific logical explanations, see the comments.

const chardet = require('chardet');
const csv = require('fast-csv'); // 比较快解析csv的速度的工具
const iconv = require('iconv-lite');

const arrayFromParseCSV = []  // 存放解析出来的一行行csv数据的
let count = 0 // 计数
// resolve, reject 是外部函数传进来的，用以判断函数执行的状态，以便正确的进行后续逻辑处理
function parseAndInsertFromCSV (filePath, resolve, reject) {
  const rs = fs.createReadStream(filePath)  // 创建可读流
  // 这里的防抖和柯里化
  const delayInsert = debounce((isEnd, cb = () => {}) => insertInBlock(isEnd, cb, rs, resolve, reject), 300)
  /// sampleSize: 5120 表示值读取文件前5120个字节的数据，就可以判断出文件的编码类型了，不需要全部读取
  chardet.detectFile(filePath, { sampleSize: 5120 }).then(encoding => {
    // 如果不是UTF-8编码，转换为utf8编码
    if (encoding !== 'UTF-8') {
      rs.pipe(iconv.decodeStream(encoding))
        .pipe(iconv.encodeStream('UTF-8'))
        .pipe(csv.parse({ header: false, ignoreEmpty: true, trim: true })) // 解析csv
        .on('error', error => {
          reject(`解析csv error: ${error}`)
        })
        .on('data', rows => {
          count++ // 计数，因为我们要分块读取和操作
          arrayFromParseCSV.push(rows) // 读到就推送到数组中
          if (count > 30000) { // 已经读了30000行，我们就要先把这3w行处理掉，避免占用过多内存。
            rs.pause() // 暂停可读流
            delayInsert(false) // false 还没有结束。注意：即使rs.pause, 流的读取也不是立即暂停的，所以需要防抖。
          }          
        }).on('end', rowCount => {
          console.log(`解析完${filePath}文件一共${rowCount}行`)
          delayInsert(true, () => {
            rs.destroy() // 销毁流
            resolve('ok') // 一个文件读取完毕了
          })
        })
    }
  })
}

The logic for cleaning data and subsequent operations is in insertInBlock.

function insertInBlock (isEnd, cb, filePath, resolve, reject) {
  const arr = doSomethingWithData() // 可能会有一些清洗数据的操作
  // 假如我们后续的需求是将数据写入数据库
  const batchInsert = () => {
    batchInsertDatabasePromise().then(() => {
      if (cb && typeof cb === 'function') cb()
      !isEnd && rs.resume() // 这一个片段的数据写入完毕，可以恢复流继续读了
    })
  }
  
  const truely = schemaHasTable() // 比如判断数据库中有没有某个表，有就写入。没有先建表再写入。
  if (truely) { //
     batchInsert()
   } else {
     // 建表或者其他操作，然后再写入
     doSomething().then(() => batchInsert())
  }
}

In this way, the process of parsing and writing is completed. Although many business codes have been simplified, the implementation is generally similar to this process.

Batch parsing of XLSX files

to CSV?

In the previous code example, we used the writable stream fs.createWriteStream to convert the XLSX file into a CSV file and then reuse the parsed CSV. It should be noted here that when writing data into a CSV format file, the bom header \ufeff should be written at the very beginning. In addition, you can use the convert function of xlsx-extract to convert XLSX files to TSV.


const { XLSX } = require('xlsx-extract')
new XLSX().convert('path/to/file.xlsx', 'path/to/destfile.tsv')
    .on('error', function (err) {
        console.error(err);
    })
    .on('end', function () {
        console.log('written');
    })

Some people may wonder, isn't it CSV, how to convert it to TSV?

In fact, the difference between tsv and CSV is only the delimiter of field values. CSV uses comma-separated values (Comma-separated values), while TSVA uses tab-separated values (Tab-separated values). The fast-csv tool we used earlier to quickly parse CSV files supports selecting the tab character \t as the delimiter flag for values.

import { parse } from '@fast-csv/parse';
const stream = parse({ delimiter: '\t' })
    .on('error', error => console.error(error))
    .on('data', row => console.log(row))
    .on('end', (rowCount: number) => console.log(`Parsed ${rowCount} rows`));

Direct parsing?

Is it possible to parse the XLSX file directly without converting it to CSV? In fact, it is also feasible.

const { xslx } = require('xlsx-extract') // 流式解析xlsx文件工具
// parser: expat, 需要额外安装node-expat,可以提高解析速度。
new XLSX().extract(filePath, { sheet_nr: 1, parser: 'expat' })
    .on('row', function (row) {
        // 每一行数据获取到时都可以触发
      }).on('error', function (err) {
        // error
     });

But this approach has a flaw, once parsing starts, the process of data reading cannot be paused. xlsx-extract encapsulates sax and provides no pause and resume methods.

What if we read the XLSX file directly with a readable stream?

const readStream = fs.createReadableStream('path/to/xlsx.xlsx')

It can be seen that the data in the stream now exists in the form of a buffer. But since the xlsx format is actually a compressed format of a zip archive, it stores the text information of the XML structure. So the readable stream cannot be used directly and needs to be decompressed first.

Unzipping can be done using the npm package unzipper.


const unzip = require('unzipper')
const zip = unzip.Parse();
rs.pipe(zip)
  .on('entry', function (entry) {
    console.log('entry ---', entry);
    const fileName = entry.path;
    const { type } = entry; // 'Directory' or 'File'
    const size = entry.vars.uncompressedSize; // There is also compressedSize;
    if (fileName === "this IS the file I'm looking for") {
      entry.pipe(fs.createWriteStream('output/path'));
    } else {
      entry.autodrain();
    }
  })

Now we have unzipped the file.

As mentioned earlier, xlsx-extract encapsulates sax, and sax itself is used to parse XML text, so we can also use sax to process readable streams.

The source code of sax parsing can be seen here, roughly according to each character to judge its content, line feed, start, end, etc., and then trigger the corresponding event.

const saxStream = require('sax').createStream(false);
saxStream.on('error', function (e) {
  console.error('error!', e);
});
saxStream.on('opentag', function (node) {
  console.log('node ---', node);
});
saxStream.on('text', (text) => console.log('text ---', typeof text, text));

Finally combine the two:

const unzip = require('unzipper');
const saxStream = require('sax').createStream(false);
const zip = unzip.Parse();

saxStream.on('error', function (e) {
  console.error('error!', e);
});
saxStream.on('opentag', function (node) {
  console.log('node ---', node);
});
saxStream.on('text', (text) => {
    console.log('text ---', typeof text, text)
});

rs.pipe(zip)
  .on('entry', function (entry) {
    console.log('entry ---', entry);
    entry.pipe(saxStream)
  })

After testing with a local XLSX file, the console prints the following:

This information corresponds to this section of the XLSX documentation. The ST SI printed in Node represents the xml tag.

In this way, in fact, we also got the data in XLSX, but these data still need to be cleaned, summarized, and one-to-one correspondence. At the same time, since we operate directly on readable streams, we can naturally pause and resume streams to implement the logic of block reading and other operations.

Summary

For small XLSX and CSV files, basic SheetJS can meet the parsing requirements of various formats of files, but once the file size is large, fragmentation and streaming read and write will become indispensable methods.

Through the decomposition of the previous examples and codes, we can understand the solutions to such problems, and we can also expand different solutions to similar needs. Once we can have a certain concept and understanding of the block processing of large-scale files, then when we encounter similar problems, we will know where the implementation ideas are.

How to handle large XLSX/CSV/TXT files?

云叔_又拍云

引用和评论

3 分钟了解 NVIDIA 新出的 H200

你可能不知道的图片加载相关知识

使用CSS给标题添加书名号并超出省略

原生electron起步-从零到一完成构建和打包

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

LRU算法，你别跑，我就要吃透你

更强大、更灵活！ defineModel 重新定义双向绑定