写一个 Pygments 语法扩展

Pygments 的仓库在 Bitbucket 上. 不过主要是 GitHub 用这东西高亮.
Python 写代码不熟悉.. 不过以前算入门, 加上只是当脚本, 问题不大

hg 基础

主仓库在这里: https://bitbucket.org/birkenfeld/pygments-main
fork 以后发现是 hg 管理的, Mercurial, 不大熟悉, 但习惯 Git 还是会用
简单的几条命令和 Git 类似

bashhg clone <url> # clone 仓库
hg add # 添加文件到仓库, 不过没有 stage
hg commit "add Cirru" # 直接就提交了
hg log --limit 1 # 查看 log 啦
hg diff # 颜色还不会配... 超难看
hg push # 直接上传到仓库

提交代码前需要设置用户名信息, 直接按 StackOverflow 配置 ~/.hgrc, OK
http://stackoverflow.com/questions/2329023/mercurial-error-abort-no-us...

Cirru 语法

我要加的语法呢.. 缩进的, 每一行单独处理就好了, 颜色有几点吧:

行头的, () 当中第一个, $ 后边第一个, 需要作为 Function 高亮
() 和 $ 当作 Operator 高亮
所有 "" 字符串作为字符串进行高亮
... 发现漏掉处理字符串中 \ 转义的高亮了...
一般的文本作为 Variable 进行高亮

测试的文件是这样的:

cirru-- https://github.com/Cirru/cirru-gopher/blob/master/code/scope.cr

set a (int 2)

print (self)

set c (child)

under c
  under parent
    print a

print $ get c a

set c x (int 3)
print $ get c x

set just-print $ code
  print a

print just-print

eval (self) just-print
eval just-print

print (string "string content\nand")

demo ((((()))))

"eval" $ string "eval"

开发流程

关于扩展开发, 专门有一页的文档描述: http://pygments.org/docs/lexerdevelopment/
主要的步骤是这样的:

fork 仓库到本地, 找到 pygments/lexers/ 目录下, 比如说 web.py 文件,
这里的文件按平台分了积累, 比如 jvm.py, functional.py
web.py 下是一些, 比如 JS, JSON.. 还有 CoffeeScript.. Cirru 就放这儿吧
在 web.py 里先要注册名字, 在 __all__ 的列表里, 命名当然是 CirruLexer 啦

python__all__ = ['BrainfuckLexer', 'BefungeLexer', ...]

添加以后执行下命令, 生成 map 文件

bash$ make mapfiles

然后是写一个 CirruLexer 的 class, 以及一些详细的配置
其中 flags 是关于正则的配置, 其他主要是语言名字的定义
然后 tokens 里一看就知道是重点... 后边细说吧
调试是通过生成一个 HTML 加上 Python 报错来的, 这个命令, 看下文档自己琢磨:

bash$ ./pygmentize -O full -f html -o /tmp/example.html tests/examplefiles/example.diff

调试好以后, 运行下命令测试一下,, 成功的话尝试上传仓库

bash$ make mapfiles
$ pip install nose
$ make test

语法规则

https://bitbucket.org/krebo/pygments-main/src/a1fed5d0a0c94b377bcce8ef...

看文档还不如看代码, JSON 的比较简单, 代码抄过来看一下, 从结尾的 root 字段开始:

class JsonLexer(RegexLexer):
    """
    For JSON data structures.

    *New in Pygments 1.5.*
    """

    name = 'JSON'
    aliases = ['json']
    filenames = ['*.json']
    mimetypes = [ 'application/json', ]

    # integer part of a number
    int_part = r'-?(0|[1-9]\d*)'

    # fractional part of a number
    frac_part = r'\.\d+'

    # exponential part of a number
    exp_part = r'[eE](\+|-)?\d+'


    flags = re.DOTALL
    tokens = {
        'whitespace': [
            (r'\s+', Text),
        ],

        # represents a simple terminal value
        'simplevalue': [
            (r'(true|false|null)\b', Keyword.Constant),
            (('%(int_part)s(%(frac_part)s%(exp_part)s|'
              '%(exp_part)s|%(frac_part)s)') % vars(),
             Number.Float),
            (int_part, Number.Integer),
            (r'"(\\\\|\\"|[^"])*"', String.Double),
        ],


        # the right hand side of an object, after the attribute name
        'objectattribute': [
            include('value'),
            (r':', Punctuation),
            # comma terminates the attribute but expects more
            (r',', Punctuation, '#pop'),
            # a closing bracket terminates the entire object, so pop twice
            (r'}', Punctuation, ('#pop', '#pop')),
        ],

        # a json object - { attr, attr, ... }
        'objectvalue': [
            include('whitespace'),
            (r'"(\\\\|\\"|[^"])*"', Name.Tag, 'objectattribute'),
            (r'}', Punctuation, '#pop'),
        ],

        # json array - [ value, value, ... }
        'arrayvalue': [
            include('whitespace'),
            include('value'),
            (r',', Punctuation),
            (r']', Punctuation, '#pop'),
        ],

        # a json value - either a simple value or a complex value (object or array)
        'value': [
            include('whitespace'),
            include('simplevalue'),
            (r'{', Punctuation, 'objectvalue'),
            (r'\[', Punctuation, 'arrayvalue'),
        ],


        # the root of a json document whould be a value
        'root': [
            include('value'),
        ],

    }

按我的理解, 每个 key 对应的一个"状态", 状态有两个用法,

当 tuple 里是三个参数时, 最后一个参数可以生命接下来进入的状态
通过 include('value') 可以引用全部的 value 状态的规则

要注意的是, 状态是 stack 叠加的, 需要有 #pop 和 #push 操作
一般第 3 个参数就已经完成了 #push, 所以 #push 专用于增加自己的状态
#pop 倒是经常用...
然后第 3 个参数可以用 tuple 写多个状态的, 另外还有 #pop:2 表示两次

tuple 第 2 个参数是 token, 具体列表这里: http://pygments.org/docs/tokens/

第一个参数是正则, Python 的正则, 难道是跟 Perl 一样的...? 文档两份
http://docs.python.org/2/library/re.html#re.match
http://wiki.ubuntu.org.cn/Python正则表达式操作指南

大体的实现的思路的话, 比较难讲, 文档本身挺清楚的..
http://pygments.org/docs/lexerdevelopment/
思路大致是, 从一开始是 root 状态,
逐次按第一个参数判断第一个正则, 是的话 consume 掉对应字符串,
如果有状态的参数, 就往 stack 上参加状态, 如果是 #pop 就退回,
然后是在哪个状态, 就从那个状态的规则继续开始匹配,
直到字符串结束..

中间出错的内容, 主要是生成的 HTML 当中 error 会用方框标记没有识别,
另外就是 Python 报错, 比如 index out of range 是 #out 退栈过头了.
然后正则出错了会报错的.. 其他的很像是黑箱了 >_<

完整代码

pythonclass CirruLexer(RegexLexer):
    """
    Syntax rules of Cirru can be found at:
    http://grammar.cirru.org/

    * using `()` to markup blocks, but limited in the same line
    * using `""` to markup strings, allow `\` to escape
    * using `$` as a shorthand for `()` till indentation end or `)`
    * using indentations for create nesting
    """

    name = 'Cirru'
    aliases = ['cirru']
    filenames = ['*.cirru', '*.cr']
    mimetypes = ['text/x-cirru']
    flags = re.MULTILINE

    tokens = {
        'string': [
            (r'[^"\\\n]', String),
            (r'\\"', String),
            (r'\\', String),
            (r'"', String, '#pop'),
        ],
        'function': [
            (r'[\w-][^\s\(\)\"]*', Name.Function, '#pop'),
            (r'\)', Operator, '#pop'),
            (r'(?=\n)', Text.Whitespace, '#pop'),
            (r'\(', Operator, '#push'),
            (r'"', String, ('#pop', 'string')),
            (r'\s+', Text.Whitespace),
        ],
        'line': [
            (r'^\B', Text.Whitespace, 'function'),
            (r'\$', Operator, 'function'),
            (r'\(', Operator, 'function'),
            (r'\)', Operator),
            (r'(?=\n)', Text.Whitespace, '#pop'),
            (r'\n', Text.Whitespace, '#pop'),
            (r'"', String, 'string'),
            (r'\s+', Text.Whitespace),
            (r'[\d\.]+', Number),
            (r'[\w-][^\"\(\)\s]*', Name.Variable),
        ],
        'root': [
            (r'^\s*', Text.Whitespace, ('line', 'function')),
            (r'^\s+$', Text.Whitespace),
        ]
    }

结果

提交了 PR, Pygments 给我过了

https://bitbucket.org/birkenfeld/pygments-main/pull-request/275/add-sy...
https://bitbucket.org/birkenfeld/pygments-main/commits/all

有时间再去试试看写 LightTable 的 Cirru 高亮

返回博客首页: http://blog.tiye.me

写一个 Pygments 语法扩展

hg 基础

Cirru 语法

开发流程

语法规则

完整代码

结果

题叶

引用和评论

一个 web worker 中 comlink 返回对象包含函数的例子

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？