Find answers from the community

Updated 5 days ago

<@334536717648265216> something strange is happenig when I run my documents through the

@Logan M @kapa.ai something strange is happenig when I run my documents through the pipeline I get this error:
Plain Text
    nodes = await pipeline.arun(documents=processed_documents)
    pipeline.persist(persist_dir=pipeline_cache)


Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/slowfs/amsgcn4/home/kleinada/ds_work/snps-edag-cc-dataset-gen/venv/lib/python3.11/site-packages/llama_index/core/ingestion/pipeline.py", line 319, in persist
self.docstore.persist(docstore_path, fs=fs)
File "/slowfs/amsgcn4/home/kleinada/ds_work/snps-edag-cc-dataset-gen/venv/lib/python3.11/site-packages/llama_index/core/storage/docstore/simple_docstore.py", line 84, in persist
self._kvstore.persist(persist_path, fs=fs)
File "/slowfs/amsgcn4/home/kleinada/ds_work/snps-edag-cc-dataset-gen/venv/lib/python3.11/site-packages/llama_index/core/storage/kvstore/simple_kvstore.py", line 88, in persist
f.write(json.dumps(self._data))
^^^^^^^^^^^^^^^^^^^^^^
File "/u/kleinada/.pyenv/versions/3.11.9/lib/python3.11/json/init.py", line 231, in dumps
return _default_encoder.encode(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/kleinada/.pyenv/versions/3.11.9/lib/python3.11/json/encoder.py", line 200, in encode
chunks = self.iterencode(o, _one_shot=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/kleinada/.pyenv/versions/3.11.9/lib/python3.11/json/encoder.py", line 258, in iterencode
return _iterencode(o, 0)
^^^^^^^^^^^^^^^^^
File "/u/kleinada/.pyenv/versions/3.11.9/lib/python3.11/json/encoder.py", line 180, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type PosixPath is not JSON serializable
k
c
L
12 comments
Please stand by @cmosguy as I search the knowledge base. I normally take about 15 seconds or so. :wave:
----------------------
This error is related to JSON serialization when trying to persist the pipeline. The issue occurs because the pipeline is trying to serialize a PosixPath object, which is not JSON serializable by default. This typically happens when file paths are involved in the data being persisted.

To resolve this issue, you can try the following approach:

  1. Convert the PosixPath objects to strings before persisting the pipeline.
  1. Use a custom JSON encoder that can handle PosixPath objects.
Here's a solution you can try:

Plain Text
import json
from pathlib import Path

class PathEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Path):
            return str(obj)
        return super().default(obj)

# After running your pipeline
nodes = await pipeline.arun(documents=processed_documents)

# When persisting, use the custom encoder
pipeline.persist(
    persist_dir=pipeline_cache,
    custom_persist_kwargs={"encoder": PathEncoder()}
)


This solution adds a custom JSON encoder that converts Path objects to strings. By passing this encoder to the persist method, you should be able to avoid the serialization error.

If this doesn't solve the issue, it's possible that the PosixPath object is nested deeper in your data structure. In that case, you might need to recursively convert all Path objects to strings before persisting.

For more information on persisting pipelines and handling serialization issues, you can refer to the LlamaIndex documentation on IngestionPipeline.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai there is no such thing as custom_persist_kwargs for persist method
Let me check the knowledge base for you again. One moment! :books:
----------------------
I apologize for the confusion in my previous response. You are correct, and I made a mistake in suggesting a non-existent parameter. Let's review the correct information about the persist method for the IngestionPipeline class.

According to the LlamaIndex documentation for IngestionPipeline.persist, the persist method has the following signature:

Plain Text
persist(persist_dir: str = './pipeline_storage', fs: Optional[AbstractFileSystem] = None, cache_name: str = DEFAULT_CACHE_NAME, docstore_name: str = DOCSTORE_FNAME) -> None


There is indeed no custom_persist_kwargs parameter. I apologize for providing incorrect information.

Given this, the issue with serializing PosixPath objects is likely occurring at a lower level, possibly in the docstore or cache serialization. To address this, you might need to modify how paths are stored in your data before persisting.

Here are a few suggestions to try:

  1. Convert any Path objects to strings before adding them to your documents or nodes.
  1. If possible, use relative paths instead of absolute paths in your data.
  1. You could try using a different filesystem implementation that handles path serialization, by passing it to the fs parameter. For example:
Plain Text
import fsspec

fs = fsspec.filesystem('file')
pipeline.persist(persist_dir=pipeline_cache, fs=fs)


If these suggestions don't resolve the issue, you may need to examine where exactly the PosixPath objects are being introduced into your data and handle them before they reach the persistence layer.
I apologize again for the earlier misinformation, and thank you for pointing out the error. If you need further assistance with this issue, please provide more details about your data structure and how you're creating and using the pipeline.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
seems like your document objects have a Path object in their metdata for some reason, you'd need to remove/cast to strint
yeah some of the objects are ImageDocuments do those have that in the metadata?
maybe? I would just iterate over the documents + metadata and ensure
yeah i have to double check that and seee what happens i think this is a bug maybe with the pipeline persist
persist can only persist serializable objects. So if its not serializable, it cant force it
not really a bug, just an input issue
Add a reply
Sign up and join the conversation on Discord