Find answers from the community

Updated 3 months ago

Hi LlamaIndex Community,

Hi LlamaIndex Community,

I'm currently working on integrating LlamaIndex with Qdrant for a project. I've encountered an issue where my data doesn't appear in the specified Qdrant collection after running through the ingestion pipeline. I've confirmed that the collection exists and that there are no errors in the logs. The HTTP response from Qdrant is 200, indicating successful communication.

Here's a brief overview of what I'm doing:

  1. I've set up an ingestion pipeline using LlamaIndex, which processes documents and is supposed to index them into Qdrant.
  2. The collection in Qdrant is already created, and the environment variables for QDRANT_API_KEY and QDRANT_URL are correctly set.
  3. The logs show successful processing of documents and chunks being added, but when I check the collection in Qdrant, it's empty.
I've double-checked the collection name and ensured the QdrantVectorStore is correctly configured. There are no errors in the debug logs, and the process finishes with an exit code 0, suggesting that the script completes successfully.

Could I be missing something in the setup or a step I've overlooked that's preventing the data from being indexed in Qdrant? Any insights or suggestions would be greatly appreciated.

Thank you in advance for your help!
a
W
L
20 comments
Code looks right to me too πŸ‘€
@alex-feel you need to include an embedding model in your transformations list
Plain Text
transformations=[
    SentenceSplitter(chunk_size=256, chunk_overlap=0),
    embed_model
],
Thank you, @Logan M, but it didn't work πŸ€·β€β™‚οΈ

This:

Plain Text
pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
        ),
        embed_model
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
)


throws an error:

Plain Text
Traceback (most recent call last):
  File "C:\Users\User\Projects\project\TEST_qdrant_ingest_data.py", line 80, in <module>
    SemanticSplitterNodeParser(
  File "C:\Users\User\Projects\project\venv\lib\site-packages\pydantic\v1\main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for SemanticSplitterNodeParser
embed_model
  field required (type=value_error.missing)


If I add an embedding model to the splitter itself, it just doesn't work (please see the attached files).
Attachment
2024-02-22_18-53-28.jpg
@Logan M, I found it worked for SentenceSplitter(chunk_size=256, chunk_overlap=0) πŸ‘ But it didn't work for SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95). Don't you know what the reason could be?
Additionally, I found that the document management doesn't work as described in the documentation (https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html). After every run, the initial number of vectors is added to the total.
You need to persist the docstore somewhere in order to have de-duplication. Otherwise its not remebering what has been ingested
I'm a little confused about the semantic splitter

Plain Text
SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=embed_model
),
Providing the embed model didn't help?
Yes, it works:

Plain Text
pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
            embed_model=embed_model,
        ),
        embed_model,
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
)


Thank you! Too many embeddings, I need to read the documentation more carefully πŸ˜…
I wanted to add to my previous message about the document management not working as described - I checked with Pinecone and got the same result. The initial number of vectors is 76, and after each run (without any changes to the files or the code), it consistently adds another 76 vectors. Should I create an issue on GitHub regarding this problem?
I've just noticed your comment about the necessity to persist the docstore for de-duplication purposes, as it's crucial for it to remember what has been ingested. I missed this detail earlier. I'll check it right now and implement the necessary adjustments. Thank you for pointing this out!
I've tried various ways to utilize persistent storage, but nothing seems to work. Specifically, after the first run of this code (the last try):

Plain Text
pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
            embed_model=embed_model,
        ),
        embed_model,
    ],
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
    cache=IngestionCache(),
)

pipeline.persist('./pipeline_storage')

pipeline.run(
    documents=reader.load_data(),
)


in the ./pipeline_storage directory, two files docstore.json and llama_cache appear, but both only contain an empty object {}. Could there be something I'm missing in the setup or the way the persistence is supposed to work?"
In the log, it's evident that the files are being opened:

Plain Text
DEBUG:fsspec.local:open file: C:/Users/User/Projects/project/pipeline_storage/llama_cache
DEBUG:fsspec.local:open file: C:/Users/User/Projects/project/pipeline_storage/docstore.json


However, ultimately, they remain empty.
I tried using Redis for storing the docstore and cache as described here https://docs.llamaindex.ai/en/stable/examples/ingestion/redis_ingestion_pipeline.html:

Plain Text
pipeline = IngestionPipeline(
    transformations=[
        SemanticSplitterNodeParser(
            buffer_size=1,
            breakpoint_percentile_threshold=95,
            embed_model=embed_model,
        ),
        embed_model,
    ],
    docstore=RedisDocumentStore.from_host_and_port(
        'localhost', 6379, namespace='document_store'
    ),
    vector_store=vector_store,
    cache=IngestionCache(
        cache=RedisCache.from_host_and_port('localhost', 6379),
        collection='redis_cache',
    ),
)


However, in cache.py (llama_index.core.ingestion.cache), there seems to be no reference to RedisCache. I'm missing something...
ah the cahce thing is not needed (and previouisly RedisCache was just an alias for RedisKVStore lol, need to add that back)

I'm surprised the first code block didn't work? let me quickly confirm
Add a reply
Sign up and join the conversation on Discord