I try to un a ingestionPipeline but it

At a glance

The community member is trying to run an ingestion pipeline, but the data is not showing up in the vector store. The comments suggest that the missing component is the embedding model, which the community member adds but still does not see the data in the vector store. After further troubleshooting, the community member realizes that the Qdrant dashboard does not show the collections in real-time, and the issue was due to the missing embedding model in the pipeline. The community members conclude that it was a learning moment.

oosiworx

I try to un a ingestionPipeline but it seems I miss something as the data does not show up in the vector store

pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=100, chunk_overlap=10),
#TitleExtractor(), #braucht ein LLM
],
vector_store=vector_store,
)

pipeline.run(documents=docs)

13 comments

WWhiteFang_Jr

Only thing missing here is embedding model, have defined it with Settings?

WWhiteFang_Jr

Are you getting any error?

oosiworx

I added the mebedding to the vector_store and was thinking thats it

oosiworx

no error just at the very end it tells me the server had closed connection, but when I set a breakpoint just before exit there is no error

oosiworx

oh youre right im wrong the embedding model is really missing 😄

oosiworx

Ok I added the mebedding model, the process now takes much more time but still no collection in the store at the end. when I create an index with the code the index seems to be empty too by looking at it. something is odd

oosiworx

reader = SimpleDirectoryReader(input_dir=subdir,
recursive=True,
)

# here we set the file_path to become no part of the embedding, it's not for this use case
# also we check if a doc has zero content then we don't try to embedd it as it would result in an error
docs = []
for doc in reader.iter_data():
if len(doc) > 1:
print('ok')
doc[0].excluded_llm_metadata_keys.append("file_path")
doc[0].excluded_embed_metadata_keys.append("file_path")
if doc[0].text != '':
docs = docs + [doc[0]]

pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=100, chunk_overlap=10),
#TitleExtractor(), #braucht ein LLM
HuggingFaceEmbedding(model_name=embedding_models[model]['path'])
],
vector_store=vector_store,
)

pipeline.run(documents=docs)

index = VectorStoreIndex.from_vector_store(
vector_store=vector_store, embed_model=embed_model, show_progress=True
)

oosiworx

when I the look whats inside index I dont find any data, which makes me think I still miss a point 🙂

WWhiteFang_Jr

Did you take a look at the vector store?

oosiworx

yes, the collection does not show, it should create a new collection

oosiworx

ok now the collection gets created. I did this which in my world should not change the result at all but it does

result = pipeline.run(documents=docs)

oosiworx

i think it was even more bad... the qdrant dashboard does not a realtime search... you have to reload the page to get the full list of collections. so its all my fault beginning at missing to add the embedding model to the pipeline

WWhiteFang_Jr

No issue, always a learning moment 💪

Add a reply

Find answers from the community

I try to un a ingestionPipeline but it