Find answers from the community

Updated 5 months ago

generic codesplitting strategy

Hi, just want to poll out here how much customizations do ppl do when using codespitter (from llamaindex, langchain or any proprietary engines)?

Thinking from the perspective of extraction some metadata at file level does get dropped (such as import statements or the package name) from the existing implementation. Ideally these should be captured in metadata.

Walking over ast is a chore, slows down the chunking process and can go wrong due to something being unaccounted for in millions of ways.

Alternatively tree sitter query expressions can help out, but those will be language specific and are only as powerful as regular expressions. Sometime you need to write wrapper code to extract surrounding blocks such as what is the parent containing a method node (class, interface, another method or something else).

Or I may be entirely wrong about my approach and there might be something far simpler and easier to pull off.
Add a reply
Sign up and join the conversation on Discord