SegmentFault 数据仓库的问题2022-04-20T15:05:22+08:00https://segmentfault.com/feeds/tag/数据仓库https://creativecommons.org/licenses/by-nc-nd/4.0/数据量不大的数据仓库方案有必要用hive吗?https://segmentfault.com/q/10100000417315932022-04-20T15:05:22+08:002022-04-20T15:05:22+08:00malie0https://segmentfault.com/u/malie00<p>不到百万级的数据量,客户要做数据仓库,感觉用hive没有起到大数据应有的作用,是不是有其他的替代方案?</p>segmentfault有做数仓的吗,找组织啦https://segmentfault.com/q/10100000407603212021-09-30T10:30:53+08:002021-09-30T10:30:53+08:00灵感https://segmentfault.com/u/linggan0<p>如题,segmentfault有做数仓的吗?我来找组织啦</p>KETTLE 同步数据log文件有错误,但数据同步成功,帮忙分析下是什么引起的https://segmentfault.com/q/10100000129811202018-01-25T08:40:20+08:002018-01-25T08:40:20+08:00wongcwhttps://segmentfault.com/u/wongcw0
<p>环境<br>1.KETTLE 7.1<br>2.JDK1.8.0_151<br>3.WIN2012 Datacenter版</p>
<p>现象描述</p>
<p>1.数据同步成功<br>2.log文件有ERROR信息<br>3.作业正常结束未受影响</p>
<p>附上报错信息部分</p>
<pre><code>00:15:19,364 ERROR [BlueprintContainerImpl] Unable to start blueprint container for bundle pentaho-big-data-api-jdbc due to unresolved dependencies [(objectClass=org.pentaho.osgi.metastore.locator.api.MetastoreLocator)]
java.util.concurrent.TimeoutException
at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:336)
at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
00:15:20,220 ERROR [BlueprintContainerImpl] Unable to start blueprint container for bundle pentaho-big-data-kettle-plugins-hive due to unresolved dependencies [(objectClass=org.pentaho.big.data.api.jdbc.DriverLocator)]
java.util.concurrent.TimeoutException
at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:336)
at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
00:15:20,368 ERROR [BlueprintContainerImpl] Unable to start blueprint container for bundle pentaho-big-data-kettle-plugins-kafka due to unresolved dependencies [(objectClass=org.pentaho.osgi.metastore.locator.api.MetastoreLocator)]
java.util.concurrent.TimeoutException
at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:336)
at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
00:15:25,573 ERROR [BlueprintContainerImpl] Unable to start blueprint container for bundle pentaho-big-data-impl-shim-hive due to unresolved dependencies [(objectClass=org.pentaho.big.data.api.jdbc.JdbcUrlParser)]
java.util.concurrent.TimeoutException
at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:336)
at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
00:15:25,580 ERROR [BlueprintContainerImpl] Unable to start blueprint container for bundle pentaho-big-data-impl-vfs-hdfs due to unresolved dependencies [(objectClass=org.pentaho.di.core.osgi.api.MetastoreLocatorOsgi)]
java.util.concurrent.TimeoutException
at org.apache.aries.blueprint.container.BlueprintContainerImpl$1.run(BlueprintContainerImpl.java:336)
at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)</code></pre>
如果让hive在读取orc文件时按照orc文件的meta来读取字段?https://segmentfault.com/q/10100000118100592017-11-01T14:23:40+08:002017-11-01T14:23:40+08:00rebiekonghttps://segmentfault.com/u/rebiekong0<p>问题描述:<br>在hive的一个外部表目录中存在一下两个orc文件<br>A文件路径是/path/db/test/t1/A.snappy.orc<br>字段顺序是:x,y,z<br>B文件路径是/path/db/test/t1/B.snappy.orc<br>字段顺序是:y,z,x<br>创建外部表时根据A文件的结构创建<br>当读取到的数据属于B文件时,查询出来的字段和文件字段不一致</p>