sqoop 增量导入,数据重复问题

发布于 1月16日  约 7 分钟

根据自增ID导入数据时重复,可以使用下列方法

image.png

图片来源
http://cn.voidcc.com/question...

下面是官网文档手册
https://sqoop.apache.org/docs...

7.2.10. Incremental Imports

Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

Sqoop提供了一种增量导入模式,该模式可用于仅检索比某些先前导入的行集新的行。

The following arguments control incremental imports:

以下参数控制增量导入:

Argument Description

--check-column (col)

Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)

指定在确定要导入的行时要检查的列。(该列的类型不得为CHAR / NCHAR / VARCHAR / VARNCHAR / LONGVARCHAR / LONGNVARCHAR)

--incremental (mode)

Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.

指定Sqoop如何确定哪些行是新的。modeincludeappend和的法律价值lastmodified。

--last-value (value)

Specifies the maximum value of the check column from the previous import.

指定上一次导入中检查列的最大值。


Sqoop supports two types of incremental imports:appendandlastmodified. You can use the--incrementalargument to specify the type of incremental import to perform.

Sqoop支持两种类型的增量导入:append和lastmodified。您可以使用--incremental参数指定要执行的增量导入的类型。

You should specifyappendmode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with--check-column. Sqoop imports rows where the check column has a value greater than the one specified with--last-value.

append导入表时,应指定模式,在该表中,将随着行ID值的增加而不断添加新行。您可以使用指定包含行ID的列--check-column。Sqoop导入行,其中check列的值大于用所指定的值--last-value。

An alternate table update strategy supported by Sqoop is calledlastmodifiedmode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with--last-valueare imported.

Sqoop支持的替代表更新策略称为lastmodified模式。当源表的行可能会更新时,应该使用此方法,并且每次此类更新会将上次修改的列的值设置为当前时间戳。--last-value导入检查列保存的时间戳比使用指定的时间戳更新的时间戳的行。

At the end of an incremental import, the value which should be specified as--last-valuefor a subsequent import is printed to the screen. When running a subsequent import, you should specify--last-valuein this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

在增量导入结束时,应--last-value为后续导入指定的值将显示在屏幕上。运行后续导入时,应--last-value以这种方式指定以确保仅导入新数据或更新数据。通过将增量导入创建为保存的作业来自动处理此问题,这是执行循环增量导入的首选机制。有关更多信息,请参阅本文档后面有关已保存作业的部分。

参考文档:
http://cn.voidcc.com/question...
https://sqoop.apache.org/docs...

阅读 114发布于 1月16日

推荐阅读
s8fh26h3
用户专栏

0 人关注
9 篇文章
专栏主页
目录