Saturday, May 20, 2023

Topiary: A code formatting engine leveraging Tree-sitter

Topiary aims to be a universal formatter engine within the Tree-sitter ecosystem. Named after the art of clipping or trimming trees into fantastic shapes, it is designed for formatter authors and formatter users:

  • Authors can create a formatter for a language without having to write their own formatting engine, or even their own parser.

  • Users benefit from uniform, comparable code style, across multiple languages, with the convenience of a single formatter tool.

The core of Topiary is written in Rust, with declarative formatting rules for bundled languages written in the Tree-sitter query language. In this first release, we have concentrated on formatting OCaml code, capitalising on the OCaml expertise within the Topiary Team and our colleague, Nicolas Jeannerod.

All development and releases happen over in the Topiary GitHub repository.

Topiary logo

Motivation

Coding style has historically been a matter of personal choice. This is inherently subjective, leading to bikeshedding over formatting choices, rather than meaningful discussion during review. Prescribed style guides, linters and ultimately automatic formatters — popularised by gofmt, whose developers had the insight to impose “good enough” uniform formatting on a codebase — have helped solve these issues.

This motivated research into developing a formatter for our Nickel language. However, its internal parser did not provide a syntax tree that retained enough context to allow the original program to be reconstructed after parsing. After creating a Tree-sitter grammar for Nickel, for syntax highlighting, we concluded that it would be possible to leverage Tree-sitter for formatting as well.

But why stop at Nickel? Topiary generalises this approach for any language that doesn’t employ semantic whitespace — for which, specialised formatters, such as our Haskell formatter Ormolu, are required — by expressing formatting style rules in the Tree-sitter query language. It thus aspires to be a “universal formatter engine” for such languages; enabling the fast development of formatters, provided a Tree-sitter grammar is available.

Design Principles

To that end, Topiary has been created with the following goals in mind:

  • Use Tree-sitter for parsing, to avoid writing yet another engine for a formatter.
  • Expect idempotency. That is, formatting of already-formatted code shouldn’t change anything.
  • For bundled formatting styles to meet the following constraints:
    • Compatible with attested formatting styles used for that language in the wild.
    • Faithful to the author’s intent: if code has been written such that it spans multiple lines, that decision is preserved.
    • Minimise changes between commits such that diffs focus mainly on the code that’s changed, rather than superficial artefacts.
    • Be well-tested and robust, such that they can be trusted on large projects.
  • For end users, the formatter should run efficiently and integrate with other developer tools, such as editors and language servers.

How it Works

As long as a Tree-sitter grammar is defined for a language, Tree-sitter can parse it and build a concrete syntax tree. Tree-sitter also allows us to run queries against this tree. We can make use of these to target interesting subtrees (e.g., an if block or a loop), to which we can apply formatting rules. These cohere into a declarative definition of how that language should be formatted.

For example:

(
  [
    (infix_operator)
    "if"
    ":"
  ] @append_space
  .
  (_)
)

This will match any node that the grammar has identified as an infix_operator, or the anonymous nodes containing if or : tokens, immediately followed by any named node (represented by the (_) wildcard pattern). The query matches on subtrees of the same shape, where the annotated node within it will be “captured” with the name @append_space; one of many formatting rules we have defined. Our formatter runs through all matches and captures, and when we process any capture called @append_space, we append a space after the annotated node.

Before rendering the output, Topiary does some post-processing, such as squashing consecutive spaces and newlines, trimming extraneous whitespace, and ordering indentation and newline instructions consistently. This means that you can, for example, prepend and append spaces to if and true, and Topiary will still output if true with just one space between the words.

To make this more concrete, consider the expression 1+2. This has the following syntax tree, if it’s interpreted as OCaml, where the match described by the above query is highlighted in red:

Syntax tree, with the match highlighted

The @append_space capture instructs Topiary to append a space after the infix_operator, rendering 1+ 2. Repeating this process for every syntactic structure we care about — making judicious generalisations wherever possible — leads us to an overall formatting style for a language.

As a formatter author, defining a style for a language is just a matter of building up these queries. End users can then apply them to their codebase with Topiary, to render their code in this style.

Topiary is not the first tool to use Tree-sitter beyond its original scope, nor is it the first tool that attempts to be a formatter for multiple languages (e.g., Prettier). This section contains some tools that we drew inspiration from, or used during the development of Topiary.

Tree-sitter Specific

  • treefmt: A general formatter orchestrator, which unifies formatters under a common interface.
  • format-all: A formatter orchestrator for Emacs.
  • null-ls.nvim: An LSP framework for Neovim that facilitates formatter orchestration.

Getting Started

We’re really excited about Topiary and the potential it has in this space.

This first release concentrates on formatting support for OCaml, as well as simple languages, such as JSON and TOML. Experimental formatting support is also available for Nickel, Bash, Rust, and Tree-sitter’s own query language; these are under active development or serve a pedagogical end for formatter authors.

We would highly encourage you to try Topiary and invite you to check out the Topiary GitHub repository to see for yourself. Information on installing and using Topiary can be found in this repository, where we would also welcome contributions, feature requests, and bug reports.



from Hacker News https://ift.tt/ShRrXpL

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.