Sunday, September 10, 2023

Show HN: Code Indexer Loop

Code Indexer Loop

Code Indexer Loop is a Python library designed to index and retrieve code snippets.

It uses the useful indexing utilities of the LlamaIndex library and the multi-language tree-sitter library to parse the code from many popular programming languages. tiktoken is used to right-size retrieval based on number of tokens and LangChain is used to obtain embeddings (defaults to OpenAI's text-embedding-ada-002) and store them in an embedded ChromaDB vector database. watchdog is used for continuous updating of the index based on file system events.

Read the launch blog post for more details about why we've built this!

Installation:

Use pip to install Code Indexer Loop from PyPI.

pip install code-indexer-loop

Usage:

Import necessary modules:

from code_indexer_loop.api import CodeIndexer

Create a CodeIndexer object and have it watch for changes:

indexer = CodeIndexer(src_dir="path/to/code/", watch=True)

Use .query to perform a search query:

query = "pandas"
print(indexer.query(query)[0:30])

Note: make sure the OPENAI_API_KEY environment variable is set. This is needed for generating the embeddings.

You can also use indexer.query_nodes to get the nodes of a query or indexer.query_documents to receive the entire source code files.

Note that if you edit any of the source code files in the src_dir it will efficiently re-index those files using watchdog and an md5 based caching mechanism. This results in up-to-date embeddings every time you query the index.

Examples

Check out the basic_usage notebook for a quick overview of the API.

Token limits

You can configure token limits for the chunks through the CodeIndexer constructor:

indexer = CodeIndexer(
    src_dir="path/to/code/", watch=True,
    target_chunk_tokens = 300,
    max_chunk_tokens = 1000,
    enforce_max_chunk_tokens = False,
    coalesce = 50
    token_model = "gpt-4"
)

Note you can choose whether the max_chunk_tokens is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the max_chunk_tokens.

The coalesce argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for coalesce is also tokens.

tree-sitter

Using tree-sitter for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.

Supported languages:

C, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScript

Note, we're mainly testing Python support. Use other languages at your own peril.

Contributing

Pull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within dev dependencies to maintain the code standard.

Tests

Run the unit tests by invoking pytest in the root.

License

Please see the LICENSE file provided with the source code.

Attribution

We'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic here and here. The implementation in code_indexer_loop is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction ("".join(chunks) == original_source_code).

from Hacker News https://ift.tt/z6271kP

SymmetricalDataSecurity