Java实现抓取在线视频并提取视频语音为文本

一、背景

最近在做大模型相关的项目，其中有个模块需要提取在线视频语音为文本并输出给用户。作为一个纯后端Jave工程师，搞这个确实是初次尝试。

二、调研

基于上述功能模块，主要有三大任务：1、提取网页中的视频 2、视频转语音 3、语音转文本。

首先是第一项：尝试了jsoup，webmagic等工具，最终还得是 selenium（也是各种踩坑）才实现了想要的效果。

第二项：这个探索是相当费劲，首选开源库 FFmpeg，但是命令行安装一直失败。因此转向其他方案，尝试了 Xuggler、JAVE、JAVE2、JavaCV 等均以失败告终。最终决定还是用 FFmpeg 吧。经过不懈努力，终于是安装好了，直接官网下载本地解压即可。

第三项：团队大哥提供了一个技术方案： https://www.funasr.com。虽说是现成的方案但是实践起来也是费了一把力。

经过上述三步，理论上来说，整体流程总算是可以调通了。但是实际运行起来却不那么顺利，如：长视频转语音超时、语音转文本超时等等。但是经过不懈努力呢，总算是搞定了上述一系列问题，实现了想要的效果。具体实践方案如下：

三、实践

1、提取网页中的视频

a. 下载插件 chromedriver

建议从网页下载，需要与chrome浏览器版本适配，不然运行不起来。下载地址： https://chromedriver.storage.googleapis.com/index.html

b. 导入selenium的jar包

<dependency>

<groupId>org.seleniumhq.selenium</groupId>

<artifactId>selenium-java</artifactId>

<version>3.1.0</version>

</dependency>

c. 话不多说，直接上🐎：

    /**
     * 从指定网址获取主视频链接
     *
     * @param targetUrl 目标网址
     * @return 主视频链接，如果未找到则返回null
     */
    public static String catchMainVideo(String targetUrl) {
        // 加载驱动，后面的路径自己要选择正确，也可以放在本地
        System.setProperty("webdriver.chrome.driver", "xxx/driver/chromedriver");
        // ChromeOptions 可以注释 这里是阻止浏览器的打开
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless");
        options.addArguments("--disable-gpu");

        // 初始化一个谷歌浏览器实例，实例名称叫driver
        WebDriver driver = new ChromeDriver(options);

        // get()打开一个站点
        driver.get(targetUrl);

        // 等待页面加载
        try {
            Thread.sleep(100);
        } catch (Exception e) {
            return null;
        }

        JavascriptExecutor js = CastUtil.convert(driver);

        List<WebElement> elements = CastUtil.convert(js.executeScript("return document.querySelectorAll('.sgVideoWrapper video source')"));

        // 处理返回的WebElement列表
        for (WebElement element : elements) {
            // 你可以获取元素的属性，例如src
            if ("video/mp4".equals(element.getAttribute("type"))) {
                return element.getAttribute("src");
            }
        }

        return null;
    }

2、视频转语音

a. 先下载 ffmpeg，建议也是网页下载，命令行下载失败了n次，升级xcode也不好使。最后还是从网页success：https://ffmpeg.org/download.html

b. 话不多说，直接上🐎

这里初次转换的时候打视频转语音没问题，但是在后续的语音转文本流程超时失败，所以最终决定视频转语音分段。

    /**
     * 将视频分割为音频文件
     *
     * @param inputVideoPath       输入视频文件的路径
     * @param outputAudioPrefix    输出音频文件的前缀
     * @param segmentSizeInSeconds 分段大小（以秒为单位）
     */
    public static void video2audio(String inputVideoPath, String outputAudioPrefix, int segmentSizeInSeconds) {
        try {
            ProcessBuilder pb = new ProcessBuilder("xxx/ffmpeg", "-i", inputVideoPath, "-vn", "-c:a", "copy", "-f", "segment", "-segment_time", String.valueOf(segmentSizeInSeconds), outputAudioPrefix + "%03d.aac");
            pb.inheritIO();
            Process process = pb.start();
            process.waitFor();
            log.info("Audio splitting completed.");
        } catch (Exception e) {
            log.error("video2audio error", e);
        }
    }

3、语音转文本

本部分实现参考了funasr，拿到离线代码之后解读简化，最后得到如下🐎，其中用到的wss地址需要自行部署，详见文档：

import com.google.common.collect.Maps;
import com.jd.store.common.util.JsonUtil;
import org.apache.commons.collections4.MapUtils;
import org.apache.commons.compress.utils.Lists;
import org.java_websocket.client.WebSocketClient;
import org.java_websocket.handshake.ServerHandshake;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.io.FileInputStream;
import java.net.URI;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;

public class FunasrWsClient extends WebSocketClient {

    private static final Logger log = LoggerFactory.getLogger(FunasrWsClient.class);

    String fileName;

    private String fileContent;

    public String getFileContent() {
        return fileContent;
    }

    public void setFileContent(String fileContent) {
        this.fileContent = fileContent;
    }

    public FunasrWsClient(URI serverURI, String fileName) {
        super(serverURI);
        this.fileName = fileName;
    }

    public void sendJson(String mode, String strChunkSize, int chunkInterval, String wavName, boolean isSpeaking, String suffix) {
        try {
            Map<String, Object> obj = Maps.newHashMap();
            obj.put("mode", mode);

            String[] chunkList = strChunkSize.split(",");
            List<Integer> array = Lists.newArrayList();
            for (String s : chunkList) {
                array.add(Integer.parseInt(s.trim()));
            }

            obj.put("chunk_size", array);
            obj.put("chunk_interval", chunkInterval);
            obj.put("wav_name", wavName);

//            if (FunasrWsClient.hotwords.trim().length() > 0) {
//                String regex = "\d+";
//                JSONObject jsonitems = new JSONObject();
//                String[] items = FunasrWsClient.hotwords.trim().split(" ");
//                Pattern pattern = Pattern.compile(regex);
//                StringBuilder tmpWords = new StringBuilder();
//                for (String item : items) {
//                    Matcher matcher = pattern.matcher(item);
//                    if (matcher.matches()) {
//                        jsonitems.put(tmpWords.toString().trim(), item.trim());
//                        tmpWords = new StringBuilder();
//                        continue;
//                    }
//                    tmpWords.append(item).append(" ");
//                }
//                obj.put("hotwords", jsonitems.toString());
//            }

//            if (suffix.equals("wav")) {
//                suffix = "mp3";
//            }
            obj.put("wav_format", suffix);
            if (isSpeaking) {
                obj.put("is_speaking", Boolean.TRUE);
            } else {
                obj.put("is_speaking", Boolean.FALSE);
            }
            log.info("sendJson: " + JsonUtil.toJsonString(obj));
            send(JsonUtil.toJsonString(obj));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public void sendEof() {
        try {
            Map<String, Object> obj = Maps.newHashMap();

            obj.put("is_speaking", Boolean.FALSE);

            log.info("sendEof: " + JsonUtil.toJsonString(obj));
            send(JsonUtil.toJsonString(obj));
        } catch (Exception e) {
            log.error("sendEof", e);
        }
    }

    public void recWav() {
        String suffix = fileName.split("\.")[fileName.split("\.").length - 1];
        sendJson(mode, strChunkSize, chunkInterval, fileName, true, suffix);
        File file = new File(fileName);

        int chunkSize = sendChunkSize;
        byte[] bytes = new byte[chunkSize];

        int readSize;
        try (FileInputStream fis = new FileInputStream(file)) {
            if (fileName.endsWith(".wav")) {
                fis.read(bytes, 0, 44);
            }
            readSize = fis.read(bytes, 0, chunkSize);
            while (readSize > 0) {
                // send when it is chunk size
                if (readSize == chunkSize) {
                    send(bytes);
                } else {
                    // send when at last or not is chunk size
                    byte[] tmpBytes = new byte[readSize];
                    System.arraycopy(bytes, 0, tmpBytes, 0, readSize);
                    send(tmpBytes);
                }
                if (!mode.equals("offline")) {
                    Thread.sleep(chunkSize / 32);
                }

                readSize = fis.read(bytes, 0, chunkSize);
            }

//            if (!mode.equals("offline")) {
//                // if not offline, we send eof and wait for 3 seconds to close
//                Thread.sleep(2000);
//                sendEof();
//                Thread.sleep(3000);
//                close();
//            }
//
//            else {
            // if offline, just send eof
            sendEof();
//            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    @Override
    public void onOpen(ServerHandshake handshake) {
        this.recWav();
    }

    @Override
    public void onMessage(String message) {
        log.info("received: " + message);

        Map<String, Object> jsonObject = JsonUtil.parseMap(message);
        if (MapUtils.isEmpty(jsonObject)) {
            return;
        }
        log.info("text: " + jsonObject.get("text"));

        // 回传文件内容
        fileContent = jsonObject.get("text").toString();

        close();
    }

    @Override
    public void onClose(int code, String reason, boolean remote) {

    }

    @Override
    public void onError(Exception e) {
        log.error("onError ", e);
    }

    static String mode = "online";
    static String strChunkSize = "5,10,5";
    static int chunkInterval = 10;
    static int sendChunkSize = 1920;

    public static String execute(String fileName) {
        try {
            String wsAddress = "wss://xxx";

            FunasrWsClient c = new FunasrWsClient(new URI(wsAddress), fileName);

            c.connect();

            TimeUnit.SECONDS.sleep(5);
            return c.fileContent;
        } catch (Exception e) {
            log.error("execute error", e);
        }
        return null;
    }
    
}

四、总结

经过一系列尝试实践，最终能够在本地电脑实现抓取在线视频并提取视频语音为文本。后续可以继续研究相关插件在服务器上的使用以及对应功能块的失败重试等，保障转换的质量。

反观上文，代码量以及流程并不多，但是在初次探索时也是充满了坑点。总之呢，借鉴前人的经验不断积累才能打磨更好的工具。

作者：京东零售王江波

来源：京东云开发者社区

Java实现抓取在线视频并提取视频语音为文本

一、背景

二、调研

三、实践

1、提取网页中的视频

2、视频转语音

3、语音转文本

四、总结

京东云开发者

引用和评论

JDK从8升级到21的问题集

【成功解决】JetBrains PyCharm 激活提示 “Key is invalid” (秘钥无效) 的终极解决方案

个人博客目录在此

【前瞻技术布局】打破"沙漏“现象→提高生成式搜索/推荐的上限

好用的开源埋点方案-ClkLog埋点用户分析系统

图解「模型上下文协议（MCP）」

‌Ant Design 编程小技巧指南

Java实现抓取在线视频并提取视频语音为文本

一、 背景

二、 调研

三、 实践

1、 提取网页中的视频

2、 视频转语音

3、 语音转文本

四、 总结

京东云开发者

引用和评论

JDK从8升级到21的问题集

【成功解决】JetBrains PyCharm 激活提示 “Key is invalid” (秘钥无效) 的终极解决方案

个人博客目录在此

【前瞻技术布局】打破"沙漏“现象→提高生成式搜索/推荐的上限

好用的开源埋点方案-ClkLog埋点用户分析系统

图解「模型上下文协议（MCP）」

‌Ant Design 编程小技巧指南

一、背景

二、调研

三、实践

1、提取网页中的视频

2、视频转语音

3、语音转文本

四、总结