本文主要研究一下Spring AI Alibaba的ObsidianDocumentReader

ObsidianDocumentReader

community/document-readers/spring-ai-alibaba-starter-document-reader-obsidian/src/main/java/com/alibaba/cloud/ai/reader/obsidian/ObsidianDocumentReader.java

public class ObsidianDocumentReader implements DocumentReader {

    private final Path vaultPath;

    private final MarkdownDocumentParser parser;

    /**
     * Constructor for reading all files in vault
     * @param vaultPath Path to Obsidian vault
     */
    public ObsidianDocumentReader(Path vaultPath) {
        this.vaultPath = vaultPath;
        this.parser = new MarkdownDocumentParser();
    }

    @Override
    public List<Document> get() {
        List<Document> allDocuments = new ArrayList<>();

        // Find all markdown files in vault
        List<ObsidianResource> resources = ObsidianResource.findAllMarkdownFiles(vaultPath);

        // Parse each file
        for (ObsidianResource resource : resources) {
            try {
                List<Document> documents = parser.parse(resource.getInputStream());
                String source = resource.getSource();

                // Add metadata to each document
                for (Document doc : documents) {
                    doc.getMetadata().put(ObsidianResource.SOURCE, source);
                }

                allDocuments.addAll(documents);
            }
            catch (IOException e) {
                throw new RuntimeException("Failed to read Obsidian file: " + resource.getFilePath(), e);
            }
        }

        return allDocuments;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static class Builder {

        private Path vaultPath;

        public Builder vaultPath(Path vaultPath) {
            this.vaultPath = vaultPath;
            return this;
        }

        public ObsidianDocumentReader build() {
            return new ObsidianDocumentReader(vaultPath);
        }

    }

}
ObsidianDocumentReader的get方法通过ObsidianResource.findAllMarkdownFiles(vaultPath)来读取ObsidianResource,之后遍历resources使用MarkdownDocumentParser进行解析

ObsidianResource

community/document-readers/spring-ai-alibaba-starter-document-reader-obsidian/src/main/java/com/alibaba/cloud/ai/reader/obsidian/ObsidianResource.java

public class ObsidianResource implements Resource {

    public static final String SOURCE = "source";

    public static final String MARKDOWN_EXTENSION = ".md";

    private final Path vaultPath;

    private final Path filePath;

    private final InputStream inputStream;

    /**
     * Constructor for single file
     * @param vaultPath Path to Obsidian vault
     * @param filePath Path to markdown file
     */
    public ObsidianResource(Path vaultPath, Path filePath) {
        Assert.notNull(vaultPath, "VaultPath must not be null");
        Assert.notNull(filePath, "FilePath must not be null");
        Assert.isTrue(Files.exists(vaultPath), "Vault directory does not exist: " + vaultPath);
        Assert.isTrue(Files.exists(filePath), "File does not exist: " + filePath);
        Assert.isTrue(filePath.toString().endsWith(MARKDOWN_EXTENSION), "File must be a markdown file: " + filePath);

        this.vaultPath = vaultPath;
        this.filePath = filePath;
        try {
            this.inputStream = new FileInputStream(filePath.toFile());
        }
        catch (IOException e) {
            throw new RuntimeException("Failed to create input stream for file: " + filePath, e);
        }
    }

    /**
     * Find all markdown files in the vault Recursively searches through all
     * subdirectories Only includes .md files and ignores hidden files/directories
     * @param vaultPath Root path of the Obsidian vault
     * @return List of ObsidianResource for each markdown file
     */
    public static List<ObsidianResource> findAllMarkdownFiles(Path vaultPath) {
        Assert.notNull(vaultPath, "VaultPath must not be null");
        Assert.isTrue(Files.exists(vaultPath), "Vault directory does not exist: " + vaultPath);
        Assert.isTrue(Files.isDirectory(vaultPath), "VaultPath must be a directory: " + vaultPath);

        List<ObsidianResource> resources = new ArrayList<>();
        try (Stream<Path> paths = Files.walk(vaultPath)) {
            paths
                // Only include .md files
                .filter(path -> path.toString().endsWith(MARKDOWN_EXTENSION))
                // Ignore hidden files and files in hidden directories
                .filter(path -> {
                    Path relativePath = vaultPath.relativize(path);
                    String[] pathParts = relativePath.toString().split("/");
                    for (String part : pathParts) {
                        if (part.startsWith(".")) {
                            return false;
                        }
                    }
                    return true;
                })
                // Only include regular files (not directories)
                .filter(Files::isRegularFile)
                .forEach(path -> resources.add(new ObsidianResource(vaultPath, path)));
        }
        catch (IOException e) {
            throw new RuntimeException("Failed to walk vault directory: " + vaultPath, e);
        }
        return resources;
    }

    //......
}    
ObsidianResource构造器要求输入vaultPath和filePath,其findAllMarkdownFiles方法会遍历vaultPath目录,找出.md结尾的文件

示例

community/document-readers/spring-ai-alibaba-starter-document-reader-obsidian/src/test/java/com/alibaba/cloud/ai/reader/obsidian/ObsidianDocumentReaderIT.java

@EnabledIfEnvironmentVariable(named = "OBSIDIAN_VAULT_PATH", matches = ".+")
class ObsidianDocumentReaderIT {

    private static final String VAULT_PATH = System.getenv("OBSIDIAN_VAULT_PATH");

    // Static initializer to log a message if environment variable is not set
    static {
        if (VAULT_PATH == null || VAULT_PATH.isEmpty()) {
            System.out.println("Skipping Obsidian tests because OBSIDIAN_VAULT_PATH environment variable is not set.");
        }
    }

    ObsidianDocumentReader reader;

    @BeforeEach
    void setUp() {
        // Only initialize if VAULT_PATH is set
        if (VAULT_PATH != null && !VAULT_PATH.isEmpty()) {
            reader = ObsidianDocumentReader.builder().vaultPath(Path.of(VAULT_PATH)).build();
        }
    }

    @Test
    void should_read_markdown_files() {
        // Skip test if reader is null
        Assumptions.assumeTrue(reader != null, "Skipping test because ObsidianDocumentReader could not be initialized");

        // when
        List<Document> documents = reader.get();

        // then
        assertThat(documents).isNotEmpty();

        // Verify document content and metadata
        for (Document doc : documents) {
            // Verify source metadata
            assertThat(doc.getMetadata()).containsKey(ObsidianResource.SOURCE);
            String source = doc.getMetadata().get(ObsidianResource.SOURCE).toString();
            assertThat(source).isNotEmpty().endsWith(ObsidianResource.MARKDOWN_EXTENSION);

            // Verify content
            assertThat(doc.getText()).isNotEmpty();

            // Print for debugging
            System.out.println("Document source: " + source);
            if (doc.getMetadata().containsKey("category")) {
                System.out.println("Document category: " + doc.getMetadata().get("category"));
            }
            System.out.println("Document content: " + doc.getText());
            System.out.println("---");
        }
    }

}

小结

spring-ai-alibaba-starter-document-reader-obsidian提供了ObsidianDocumentReader用于读取指定仓库(vaultPath)下的所有markdown文件,之后使用MarkdownDocumentParser去解析为List<Document>

doc


codecraft
11.9k 声望2k 粉丝

当一个代码的工匠回首往事时,不因虚度年华而悔恨,也不因碌碌无为而羞愧,这样,当他老的时候,可以很自豪告诉世人,我曾经将代码注入生命去打造互联网的浪潮之巅,那是个很疯狂的时代,我在一波波的浪潮上留下...