There's been some recent discussion as to whether tree-sitter grammars can be used to parse markdown with some hacks or not (currently it's being done by working around all the tree-sitter machinery, resulting in a lot of problems), with no consensus among plugin authors:
I’ve been using tree-sitter via FFI from Common Lisp, but what I’d really like would be a way to write my own code generator so that the generated parser could be “native” lisp code. Otherwise, it’s an amazing tool: my only other complaint would be the lack of a grammar for objective-c which would be useful for a lisp/objective-c bridge I’ve been working on.
I think that it'd be pretty easy to generate parser code in other languages besides C, but it would be a lot of work to do to port the core library itself[1] to those other languages.
There's an architecture for compilers that I've been wanting for years where a keystroke change to the sourcecode results in an incremental change to the AST, and then the compiler can consume that AST delta to generate a binary patch to the compiled executable.
Would tree-sitter be able to be used for that? (What I want is to feed tree-sitter a stream of keystroke changes and get out a stream of minimal AST changes as a result).
You don't get the AST _diff_ as the result (you get a new tree whose structure is shared with the old tree), but tree-sitter is specifically designed to support this kind of incremental edit use case: https://tree-sitter.github.io/tree-sitter/using-parsers#edit...
I've done two grammars for my own use in the last few months (well, one isn't quite complete yet) and it's been quite an enjoyable (learning) experience. Thanks for sharing this tool!
When I played around with tree sitter a bit I noticed there were situations where ast elements didn't exactly contain what I'd expect them to. For example: comments are represented in the AST but unfortunately they don't have the contents of the comment parsed out following the laguanges conventions.
I was wondering if this is a case I could open an issue about? Is this for the main tree sitter repo or should I open one language-by-language?
I was looking into automating some stuff across all languages with tree-sitter but handling all of the languages comments syntaxes made it very hard.
Most tree-sitter grammars just parse comments as a single token. Can you give an example of what you mean when you say "contents of the comment parsed out"?
Are you talking about conventions like JSDoc, for putting structured data inside of comments? On GitHub, we handle that by parsing JSDoc comments in a separate pass, using a separate parser. We do it this way because JSDoc isn't really part of the JavaScript language, not all projects use JSDoc, and not all applications are interested in parsing the text inside of comments.
I don't think you can do this without recompiling, since the grammars get translated into C code before use. But the built-in command line tools (‘tree-sitter parse’, etc) all support a mode where they will detect local changes to a checked-out grammar definition, and recompile on the fly if needed. (This happens each time the CLI program is started up; it doesn't happen during a long-running process.)
The obvious answer is to embed TCC or another C compiler and either generate a dynamic library or generate wasm and load it directly into the process.
exec_wasm(generate_wasm(generate_c(grammar)))
Now if you can make that whole fn chain incremental, then a delta_grammar -> delta_c -> delta_wasm -> delta_recomputed_wasm_call stack, this will propagate deltas down to exec_wasm and you could dynamically execute the generated code as the grammar changes.
One day, I would love to generalize the web-based playground so that you could edit the grammars. But it's complicated, because we use C as our output language, so you would always need to recompile the C after changing the grammar.
So, I would say that it's not on our near-term roadmap.
I'm curious if tree-sitter can handle c++/c. I think it's supper difficult with meta programming. Without the preprocessor, I think it is not possible to parse c++ correctly.
We do have C and C++ grammars [1,2] but they need some love. You're right that these two languages are among the hardest to support. You could get a tree-sitter external scanner to mimic the preprocessor without too much difficulty, but you'd still run into the problem that your macro definitions might appear in another file. Parsing in general is much easier to implement and reason about if the parse result depends only on the content of the single file that you're looking at.
Thanks for building this. I had not heard of it before, but it looks great Are there more tutorials elsewhere on the Internet you would recommned, besides what is in the documentation?
In the near future, we'll create some more GitHub-specific documentation that walks you through how to add advanced language support for any programming language on GitHub, by writing a Tree-sitter grammar, and then by writing the tree queries that are used for syntax highlighting, simple code navigation, and someday soon... precise code navigation.