Find answers from the community

Updated 2 years ago

Splitting

At a glance

The community member asked if it is possible to use LlamaIndex to split a large document into an array of text based on specific token counts (1000, 2000, 4000, and 8000). Another community member responded that this can be done using a TokenTextSplitter and provided example code to split the text into chunks of 4000 tokens with a 200 token overlap, using the GPT-3.5-Turbo tokenizer.

ffullstack

Is it possible to use llama index to split a large document into just an array of text, but based on 1000, 2000, 4000, and 8000 tokens?

2 comments

LLogan M

For sure. You could use a token text splitter, and use splitter.split_text()

By default it uses a gpt2 tokenizer for counting tokens, but you can pass in any tokenizer function as a kwarg when creating the object

ffullstack

Plain Text

        text_splitter = TokenTextSplitter(
            separator="\n\n",
            chunk_size=4000,
            chunk_overlap=200,
            backup_separators=["\n"],
            tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
        )

        node_parser = SimpleNodeParser(text_splitter=text_splitter)

        nodes = node_parser.get_nodes_from_documents(
            documents, show_progress=False)

        array_of_text = []
        for node in nodes:
            array_of_text.append(node.text)

thank you

Add a reply