Find answers from the community

Updated 2 months ago

I'm running an ingestion pipeline on a

I'm running an ingestion pipeline on a document, but it's not producing any nodes. I check the document and it seems to have proper content and be set properly, any reason why this might be the case?
L
m
38 comments
can you give some sample code?
Yes I'll grab it in just a few ty
@Logan M
Plain Text
# Create ingestion pipeline
def create_ingestion_pipeline() -> IngestionPipeline:
    # Node parser
    node_parser = SentenceSplitter(
        separator=" ",
        chunk_size=1024,
        chunk_overlap=200,
    )

    worker_llm = AzureOpenAI(
        temperature=0.1,
        model="gpt-3.5-turbo",
        max_tokens=512,
        engine=AZURE_OPENAI_API_DEPLOYMENT_NAME_GPT_4,
        azure_endpoint=AZURE_OPENAI_API_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )

    title_extractor = TitleExtractor(llm=worker_llm, num_workers=4)

    pipeline = IngestionPipeline(
        transformations=[
            node_parser,
            title_extractor,
        ]
    )
    return pipeline

def create_document_from_s3(s3_url: str) -> List[Document]:
    """Creates a document from an S3 URL."""
    # FIXME: will need to handle cruft
    # we have to download locally based on available research of data loading in llama-index
    file_path = download_file(s3_url)
    loader = UnstructuredReader()
    document = loader.load_data(file_path)
    delete_local_file(file_path)
    return document

main functionality
Plain Text
pipeline = create_ingestion_pipeline()
documents = create_document_from_s3(s3_url)
nodes = pipeline.run(document=documents)
vector_store = get_pinecone_vector_store()
vector_store.add(nodes=nodes)

value of documents when printed
Plain Text
[Document(id_='6c1580c3-7dc2-47b5-9909-9b0a60a8dfa2', embedding=None, metadata={'filename': '/tmp/llama-indexer/files/91301739-786a-45b3-b387-f5a7a7c35992-1689966736444-864818-sammyisagamer2.txt', 'user_id': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='b10a29d59e236cfb927b37b012eeb2bf79e97898e16f7d4243947225baeb0bdd', text='Sammy roberts is a gamer', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]
nodes = pipeline.run(document=documents)

And nodes is empty?
also a secondary question is, is this a normal way to add nodes to an existing pinecone:
Plain Text
vector_store.add(nodes=nodes)
this is the output when i print nodes and print the results of the above vector store add
Plain Text
[]
Upserted vectors: 0it [00:00, ?it/s]
[]
Normally I would attach the vector store to the pipeline, it will insert for you

Plain Text
    pipeline = IngestionPipeline(
        transformations=[
            node_parser,
            title_extractor,
        ],
        vector_store=vector_store
    )
And I'm assuming that document print was before running the pipeline?
I'm low-key stumped. I would almost use a debugger at this point lol
okay, that makes sense, essentially I just want to be able to add user uploaded files to pinecone dynamically
yeah that makes sense ill jump into it with a debugger and see if its just the file potentially
Yea, like the same code works for me (minus the azure LLM).

you can narrow it down a bit too, and run each step

Plain Text
documents = ...
nodes = node_parser(documents)
nodes = title_extractor(nodes)
above is what the pipeline does lol
should I add the embedding model to the transformations?
i sometimes see it in transformation
okay ill try that as well
node_parser.get_nodes_from_documents(documents=documents) works lol
ah yes, definitely add embeddings to the pipeline
missed that πŸ˜…
does order matter for that?
I Do have it on the service context:
Plain Text
VectorStoreIndex.from_vector_store(
        service_context=get_service_context(),
        vector_store=vector_store,  # type: ignore
        storage_context=get_storage_context(),
    )

def get_service_context():
    llm = get_llm()
    embed_model = get_embed_model()
    service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)  # pyright: ignore[reportUnknownMemberType]
    return service_context
Order does matter πŸ‘€ Since you probably want to embed the nodes after splitting, and after attaching the new metadata
its in the service context, but the ingestion pipeline just runs what its given. And vector_store.add(nodes) requires the nodes to have embeddings attached
Okay so:
Plain Text
def create_ingestion_pipeline() -> IngestionPipeline:
    # Node parser
    node_parser = SentenceSplitter()

    worker_llm = AzureOpenAI(
        temperature=0.1,
        model="gpt-3.5-turbo",
        max_tokens=512,
        engine=AZURE_OPENAI_API_DEPLOYMENT_NAME_GPT_4,
        azure_endpoint=AZURE_OPENAI_API_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )

    title_extractor = TitleExtractor(llm=worker_llm)

    pipeline = IngestionPipeline(
        transformations=[node_parser, title_extractor, AzureOpenAIEmbedding()],
        vector_store=get_pinecone_vector_store(),
    )

    return pipeline
pipeline does not work, but doing the node parser alone on its own does work, not sure how to use title extractor alone
its quite annoying, i even tried defining the pipeline outside of the function in case some weird pass by ref thing was happening
oh the code I gave before is exactly what I would run
Plain Text
documents = ...
nodes = node_parser(documents)
nodes = title_extractor(nodes)
nodes = AzureOpenAIEmbedding(<azure kwargs>)(nodes)
each transformation has the __call__() method implemented, so you can chain them like that
hmm that was erroring for me
what kind of error?
nvm pylance was yelling at me, but it seems pylance sucks
it seems to be running now
So I guess with that, I hope that the problematic piece can be found πŸ˜… if all three work, then I'm super lost
Yeah, its quite strange, just gotta figure out why pinecone randomly decided not to connect and hopefully this works and i can get off the computer
Add a reply
Sign up and join the conversation on Discord