Find answers from the community

Updated 3 months ago

hey all having an issue with pickling

hey all - having an issue with pickling objects in what seems to be a relatively simple scenario. here's the code:
Plain Text
from llama_index import VectorStoreIndex
from llama_index.schema import Document
import os
from llama_index.node_parser import SimpleNodeParser
from llama_index.text_splitter import TokenTextSplitter

class NewlineTextSplitter(TokenTextSplitter):
    def split_text(self, text):
        # Split the text into chunks based on newlines
        chunks = text.split('\n\n')

        return chunks

class CharacterSheetIndexer:
    def __init__(self, character_sheets_dir):
        self.character_sheets_dir = character_sheets_dir
        self.indexes = {}

    def create_indexes(self):
        # Create a NodeParser that uses NewlineTextSplitter
        node_parser = SimpleNodeParser(text_splitter=NewlineTextSplitter())

        # Read all character sheets
        for filename in os.listdir(self.character_sheets_dir):
            with open(os.path.join(self.character_sheets_dir, filename), 'r') as f:
                character_sheet = f.read()

            # Create a document from the character sheet
            document = Document(text=character_sheet)

            # Chunk the document into nodes using the NodeParser
            nodes = node_parser.parse(document)

            # Create an index for the character sheet
            self.indexes[filename] = VectorStoreIndex.from_documents(nodes)

def main():
    # Create an instance of CharacterSheetIndexer
    indexer = CharacterSheetIndexer(character_sheets_dir='resources/characters')

    # Index all character sheets
    indexer.create_indexes()

if __name__ == "__main__":
    main()
L
K
21 comments
which object are you pickling here?
that's the thing... i'm not pickling anything heh
full stack trace:
i started a fresh project with the same code and nothing else so i'm confident it's not related to the rest of my project. i used poetry to install the packages. here's the pyproject.toml file:
Plain Text
[tool.poetry]
name = "test-llamaindex"
version = "0.1.0"
description = ""
authors = ["kavinstewart"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.11"
llama-index = "^0.8.29.post1"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
i'm using cursor and this is what it claims is going on:

It seems like the issue is caused by the youtokentome package creating an object of type CoreBPE that can't be pickled. This is likely happening when the tiktoken.get_encoding function is called in the TokenTextSplitter class. A potential solution could be to modify the NewlineTextSplitter class to avoid creating a CoreBPE object. However, this would require understanding the youtokentome package's code and potentially making changes to it. Another solution could be to avoid creating a deepcopy of the TokenTextSplitter object, but this would require changing the pydantic library's code. I recommend looking into these potential solutions and seeing which one is most feasible for your situation.

thoughts?
hmmm, let me see if I can re-produce the issue and debug from there
here's the full analysis in case useful. (i'm going to take a closer look at this suggestion):
I think the issue is the class you are extending has a tokenizer attribute and is also a pydantic object

And something about extending the class is causing it to pickle/copy attributes, and fails when it hits this...

Trying to figure out a workaround -- very annoying behaviour from pydantic tho
I think I have a fix for it, but it requires a change to the library code :PSadge:
If the tokenizer is a PrivateAttr instead of a Field, then it works fine πŸ€”
ok i'll look for a workaround then
tbh the splitter should be easier to override too though -- I'll double check if changing that to a private attr breaks things anywhere else, and make a PR if not
what do most people do? just use the normal splitter?
or use the SentenceSplitter?
The default is sentence splitter -- most people seem to use that.

Although if your data isn't very "sentence" based, you might want to use the token splitter
gotcha thx!
Add a reply
Sign up and join the conversation on Discord