Find answers from the community

Updated last year

Splitting

Is it possible to use llama index to split a large document into just an array of text, but based on 1000, 2000, 4000, and 8000 tokens?
L
f
2 comments
For sure. You could use a token text splitter, and use splitter.split_text()

By default it uses a gpt2 tokenizer for counting tokens, but you can pass in any tokenizer function as a kwarg when creating the object
Plain Text
        text_splitter = TokenTextSplitter(
            separator="\n\n",
            chunk_size=4000,
            chunk_overlap=200,
            backup_separators=["\n"],
            tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
        )

        node_parser = SimpleNodeParser(text_splitter=text_splitter)

        nodes = node_parser.get_nodes_from_documents(
            documents, show_progress=False)

        array_of_text = []
        for node in nodes:
            array_of_text.append(node.text)

thank you
Add a reply
Sign up and join the conversation on Discord