本文主要研究一下Spring AI Alibaba的YoutubeDocumentReader

YoutubeDocumentReader

community/document-readers/spring-ai-alibaba-starter-document-reader-youtube/src/main/java/com/alibaba/cloud/ai/reader/youtube/YoutubeDocumentReader.java

public class YoutubeDocumentReader implements DocumentReader {

  private static final String WATCH_URL = "https://www.youtube.com/watch?v=%s";

  private final ObjectMapper objectMapper;

  private static final List<String> YOUTUBE_URL_PATTERNS = List.of("youtube\\.com/watch\\?v=([^&]+)",
      "youtu\\.be/([^?&]+)");

  private final String resourcePath;

  private static final int MEMORY_SIZE = 5;

  private static final int BYTE_SIZE = 1024;

  private static final int MAX_MEMORY_SIZE = MEMORY_SIZE * BYTE_SIZE * BYTE_SIZE;

  private static final WebClient WEB_CLIENT = WebClient.builder()
    .defaultHeader("Accept-Language", "en-US")
    .codecs(configurer -> configurer.defaultCodecs().maxInMemorySize(MAX_MEMORY_SIZE))
    .build();

  public YoutubeDocumentReader(String resourcePath) {
    Assert.hasText(resourcePath, "Query string must not be empty");
    this.resourcePath = resourcePath;
    this.objectMapper = new ObjectMapper();
  }

  @Override
  public List<Document> get() {
    List<Document> documents = new ArrayList<>();
    try {
      String videoId = extractVideoIdFromUrl(resourcePath);
      String subtitleContent = getSubtitleInfo(videoId);
      documents.add(new Document(StringEscapeUtils.unescapeHtml4(subtitleContent)));
    }
    catch (IOException e) {
      throw new RuntimeException("Failed to load document from Youtube: {}", e);
    }
    return documents;
  }

  // Method to extract the videoId from the resourcePath
  public String extractVideoIdFromUrl(String resourcePath) {
    for (String pattern : YOUTUBE_URL_PATTERNS) {
      Pattern regexPattern = Pattern.compile(pattern);
      Matcher matcher = regexPattern.matcher(resourcePath);
      if (matcher.find()) {
        return matcher.group(1); // Extract the videoId (captured group)
      }
    }
    throw new IllegalArgumentException("Invalid YouTube URL: Unable to extract videoId.");
  }

  public String getSubtitleInfo(String videoId) throws IOException {
    // Step 1: Fetch the HTML content of the YouTube video page
    String url = String.format(WATCH_URL, videoId);
    String htmlContent = fetchHtmlContent(url).block(); // Blocking for simplicity in
                              // this example

    // Step 2: Extract the subtitle tracks from the HTML
    String captionsJsonString = extractCaptionsJson(htmlContent);
    if (captionsJsonString != null) {
      JsonNode captionsJson = objectMapper.readTree(captionsJsonString);
      JsonNode captionTracks = captionsJson.path("playerCaptionsTracklistRenderer").path("captionTracks");

      // Check if captionTracks exists and is an array
      if (captionTracks.isArray()) {
        // Step 3: Extract and decode each subtitle track's URL
        StringBuilder subtitleInfo = new StringBuilder();
        JsonNode captionTrack = captionTracks.get(0);
        // Safely access languageCode and baseUrl with null checks
        String language = captionTrack.path("languageCode").asText("Unknown");
        String urlEncoded = captionTrack.path("baseUrl").asText("");

        // Decode the URL to avoid \u0026 issues
        String decodedUrl = URLDecoder.decode(urlEncoded, StandardCharsets.UTF_8);

        String subtitleText = fetchSubtitleText(decodedUrl);
        subtitleInfo.append("Language: ").append(language).append("\n").append(subtitleText).append("\n\n");

        return subtitleInfo.toString();
      }
      else {
        return "No captions available.";
      }
    }
    else {
      return "No captions data found.";
    }
  }

  private Mono<String> fetchHtmlContent(String url) {
    // Use WebClient to fetch HTML content asynchronously
    return WEB_CLIENT.get().uri(url).retrieve().bodyToMono(String.class);
  }

  private String extractCaptionsJson(String htmlContent) {
    // Extract the captions JSON from the HTML content
    String marker = "\"captions\":";
    int startIndex = htmlContent.indexOf(marker);
    if (startIndex != -1) {
      int endIndex = htmlContent.indexOf("\"videoDetails", startIndex);
      if (endIndex != -1) {
        String captionsJsonString = htmlContent.substring(startIndex + marker.length(), endIndex);
        return captionsJsonString.trim();
      }
    }
    return null;
  }

  private String fetchSubtitleText(String decodedUrl) throws IOException {
    // Fetch the subtitle text by making a request to the decoded subtitle URL
    org.jsoup.nodes.Document doc = Jsoup.connect(decodedUrl).get();

    // Assuming the subtitle text is inside <transcript> tags, extract the text
    StringBuilder subtitleText = new StringBuilder();
    doc.select("text").forEach(textNode -> {
      String text = textNode.text();
      subtitleText.append(text).append("\n");
    });

    return subtitleText.toString();
  }

}
YoutubeDocumentReader构造器要求输入resourcePath,它内置了WebClient,其get方法先通过extractVideoIdFromUrl获取videoId,再通过getSubtitleInfo获取字幕,最后组装为List<Document>返回;getSubtitleInfo通过请求https://www.youtube.com/watch?v=videoId,之后解析html内容获取videoDetails内容,再json解析提取language、subtitleText

示例

community/document-readers/spring-ai-alibaba-starter-document-reader-youtube/src/test/java/com/alibaba/cloud/ai/reader/youtube/YoutubeDocumentReaderTest.java

public class YoutubeDocumentReaderTest {

  private static final Logger logger = LoggerFactory.getLogger(YoutubeDocumentReaderTest.class);

  @Test
  void youtubeDocumentReaderTest() {
    YoutubeDocumentReader youtubeDocumentReader = new YoutubeDocumentReader(
        "https://www.youtube.com/watch?v=q-9wxg9tQRk");
    List<Document> documents = youtubeDocumentReader.get();
    logger.info("documents: {}", documents);
  }

}

小结

spring-ai-alibaba-starter-document-reader-youtube提供了YoutubeDocumentReader,它通过webClient去请求指定url,提取字幕的language以及字幕内容,最后组装为List<Document>返回。

doc


codecraft
11.9k 声望2k 粉丝

当一个代码的工匠回首往事时,不因虚度年华而悔恨,也不因碌碌无为而羞愧,这样,当他老的时候,可以很自豪告诉世人,我曾经将代码注入生命去打造互联网的浪潮之巅,那是个很疯狂的时代,我在一波波的浪潮上留下...