hey all having an issue with pickling

At a glance

hey all - having an issue with pickling objects in what seems to be a relatively simple scenario. here's the code:

Plain Text

from llama_index import VectorStoreIndex
from llama_index.schema import Document
import os
from llama_index.node_parser import SimpleNodeParser
from llama_index.text_splitter import TokenTextSplitter

class NewlineTextSplitter(TokenTextSplitter):
    def split_text(self, text):
        # Split the text into chunks based on newlines
        chunks = text.split('\n\n')

        return chunks

class CharacterSheetIndexer:
    def __init__(self, character_sheets_dir):
        self.character_sheets_dir = character_sheets_dir
        self.indexes = {}

    def create_indexes(self):
        # Create a NodeParser that uses NewlineTextSplitter
        node_parser = SimpleNodeParser(text_splitter=NewlineTextSplitter())

        # Read all character sheets
        for filename in os.listdir(self.character_sheets_dir):
            with open(os.path.join(self.character_sheets_dir, filename), 'r') as f:
                character_sheet = f.read()

            # Create a document from the character sheet
            document = Document(text=character_sheet)

            # Chunk the document into nodes using the NodeParser
            nodes = node_parser.parse(document)

            # Create an index for the character sheet
            self.indexes[filename] = VectorStoreIndex.from_documents(nodes)

def main():
    # Create an instance of CharacterSheetIndexer
    indexer = CharacterSheetIndexer(character_sheets_dir='resources/characters')

    # Index all character sheets
    indexer.create_indexes()

if __name__ == "__main__":
    main()

21 comments

LLogan M

which object are you pickling here?

KKavin

that's the thing... i'm not pickling anything heh

KKavin

full stack trace:

KKavin

i started a fresh project with the same code and nothing else so i'm confident it's not related to the rest of my project. i used poetry to install the packages. here's the pyproject.toml file:

Plain Text

[tool.poetry]
name = "test-llamaindex"
version = "0.1.0"
description = ""
authors = ["kavinstewart"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.11"
llama-index = "^0.8.29.post1"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

KKavin

i'm using cursor and this is what it claims is going on:

It seems like the issue is caused by the youtokentome package creating an object of type CoreBPE that can't be pickled. This is likely happening when the tiktoken.get_encoding function is called in the TokenTextSplitter class. A potential solution could be to modify the NewlineTextSplitter class to avoid creating a CoreBPE object. However, this would require understanding the youtokentome package's code and potentially making changes to it. Another solution could be to avoid creating a deepcopy of the TokenTextSplitter object, but this would require changing the pydantic library's code. I recommend looking into these potential solutions and seeing which one is most feasible for your situation.

thoughts?

LLogan M

hmmm, let me see if I can re-produce the issue and debug from there

KKavin

thx

KKavin

here's the full analysis in case useful. (i'm going to take a closer look at this suggestion):

LLogan M

I think the issue is the class you are extending has a tokenizer attribute and is also a pydantic object

And something about extending the class is causing it to pickle/copy attributes, and fails when it hits this...

Trying to figure out a workaround -- very annoying behaviour from pydantic tho

KKavin

agreed

LLogan M

I think I have a fix for it, but it requires a change to the library code :PSadge:

LLogan M

If the tokenizer is a PrivateAttr instead of a Field, then it works fine 🤔

KKavin

ah haha

KKavin

ok i'll look for a workaround then

KKavin

thx man!

LLogan M

tbh the splitter should be easier to override too though -- I'll double check if changing that to a private attr breaks things anywhere else, and make a PR if not

KKavin

kk thx

KKavin

what do most people do? just use the normal splitter?

KKavin

or use the SentenceSplitter?

LLogan M

The default is sentence splitter -- most people seem to use that.

Although if your data isn't very "sentence" based, you might want to use the token splitter

KKavin

gotcha thx!

Add a reply

Find answers from the community

hey all having an issue with pickling