Find answers from the community

Updated 2 years ago

Splitting

At a glance
The community member asked if it is possible to use LlamaIndex to split a large document into an array of text based on specific token counts (1000, 2000, 4000, and 8000). Another community member responded that this can be done using a TokenTextSplitter and provided example code to split the text into chunks of 4000 tokens with a 200 token overlap, using the GPT-3.5-Turbo tokenizer.
Is it possible to use llama index to split a large document into just an array of text, but based on 1000, 2000, 4000, and 8000 tokens?
L
f
2 comments
For sure. You could use a token text splitter, and use splitter.split_text()

By default it uses a gpt2 tokenizer for counting tokens, but you can pass in any tokenizer function as a kwarg when creating the object
Plain Text
        text_splitter = TokenTextSplitter(
            separator="\n\n",
            chunk_size=4000,
            chunk_overlap=200,
            backup_separators=["\n"],
            tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
        )

        node_parser = SimpleNodeParser(text_splitter=text_splitter)

        nodes = node_parser.get_nodes_from_documents(
            documents, show_progress=False)

        array_of_text = []
        for node in nodes:
            array_of_text.append(node.text)

thank you
Add a reply
Sign up and join the conversation on Discord