Hi LlamaIndex Community,

aalex-feel

Hi LlamaIndex Community,

I'm currently working on integrating LlamaIndex with Qdrant for a project. I've encountered an issue where my data doesn't appear in the specified Qdrant collection after running through the ingestion pipeline. I've confirmed that the collection exists and that there are no errors in the logs. The HTTP response from Qdrant is 200, indicating successful communication.

Here's a brief overview of what I'm doing:

I've set up an ingestion pipeline using LlamaIndex, which processes documents and is supposed to index them into Qdrant.
The collection in Qdrant is already created, and the environment variables for QDRANT_API_KEY and QDRANT_URL are correctly set.
The logs show successful processing of documents and chunks being added, but when I check the collection in Qdrant, it's empty.

I've double-checked the collection name and ensured the QdrantVectorStore is correctly configured. There are no errors in the debug logs, and the process finishes with an exit code 0, suggesting that the script completes successfully.

Could I be missing something in the setup or a step I've overlooked that's preventing the data from being indexed in Qdrant? Any insights or suggestions would be greatly appreciated.

Thank you in advance for your help!

20 comments

aalex-feel

Here is my code

aalex-feel

Here are logs

aalex-feel

References:
https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/root.html#document-management
https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/transformations.html#combining-with-an-index

WWhiteFang_Jr

Code looks right to me too 👀

LLogan M

@alex-feel you need to include an embedding model in your transformations list

LLogan M

Plain Text

transformations=[
    SentenceSplitter(chunk_size=256, chunk_overlap=0),
    embed_model
],

aalex-feel

Thank you, @Logan M, but it didn't work 🤷‍♂️

This:

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
        ),
        embed_model
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
)

throws an error:

Plain Text

Traceback (most recent call last):
  File "C:\Users\User\Projects\project\TEST_qdrant_ingest_data.py", line 80, in <module>
    SemanticSplitterNodeParser(
  File "C:\Users\User\Projects\project\venv\lib\site-packages\pydantic\v1\main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for SemanticSplitterNodeParser
embed_model
  field required (type=value_error.missing)

If I add an embedding model to the splitter itself, it just doesn't work (please see the attached files).

Attachment

aalex-feel

@Logan M, I found it worked for SentenceSplitter(chunk_size=256, chunk_overlap=0) 👍 But it didn't work for SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95). Don't you know what the reason could be?

aalex-feel

Additionally, I found that the document management doesn't work as described in the documentation (https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html). After every run, the initial number of vectors is added to the total.

LLogan M

You need to persist the docstore somewhere in order to have de-duplication. Otherwise its not remebering what has been ingested

LLogan M

I'm a little confused about the semantic splitter

Plain Text

SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
),

LLogan M

Providing the embed model didn't help?

aalex-feel

Yes, it works:

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
            embed_model=embed_model,
        ),
        embed_model,
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
)

Thank you! Too many embeddings, I need to read the documentation more carefully 😅

aalex-feel

I wanted to add to my previous message about the document management not working as described - I checked with Pinecone and got the same result. The initial number of vectors is 76, and after each run (without any changes to the files or the code), it consistently adds another 76 vectors. Should I create an issue on GitHub regarding this problem?

aalex-feel

I've just noticed your comment about the necessity to persist the docstore for de-duplication purposes, as it's crucial for it to remember what has been ingested. I missed this detail earlier. I'll check it right now and implement the necessary adjustments. Thank you for pointing this out!

LLogan M

Sounds good!

aalex-feel

I've tried various ways to utilize persistent storage, but nothing seems to work. Specifically, after the first run of this code (the last try):

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
            embed_model=embed_model,
        ),
        embed_model,
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
    cache=IngestionCache(),
)

pipeline.persist('./pipeline_storage')

pipeline.run(
    documents=reader.load_data(),
)

in the ./pipeline_storage directory, two files docstore.json and llama_cache appear, but both only contain an empty object {}. Could there be something I'm missing in the setup or the way the persistence is supposed to work?"

aalex-feel

In the log, it's evident that the files are being opened:

Plain Text

DEBUG:fsspec.local:open file: C:/Users/User/Projects/project/pipeline_storage/llama_cache
DEBUG:fsspec.local:open file: C:/Users/User/Projects/project/pipeline_storage/docstore.json

However, ultimately, they remain empty.

aalex-feel

I tried using Redis for storing the docstore and cache as described here https://docs.llamaindex.ai/en/stable/examples/ingestion/redis_ingestion_pipeline.html:

Plain Text

pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
            embed_model=embed_model,
        ),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        'localhost', 6379, namespace='document_store'
    ),
    vector_store=vector_store,
    cache=IngestionCache(
        cache=RedisCache.from_host_and_port('localhost', 6379),
        collection='redis_cache',
    ),
)

However, in cache.py (llama_index.core.ingestion.cache), there seems to be no reference to RedisCache. I'm missing something...

LLogan M

ah the cahce thing is not needed (and previouisly RedisCache was just an alias for RedisKVStore lol, need to add that back)

I'm surprised the first code block didn't work? let me quickly confirm

Add a reply

Find answers from the community

Hi LlamaIndex Community,