I'm running an ingestion pipeline on a

At a glance

I'm running an ingestion pipeline on a document, but it's not producing any nodes. I check the document and it seems to have proper content and be set properly, any reason why this might be the case?

38 comments

LLogan M

can you give some sample code?

mmaybe goats dont exist

Yes I'll grab it in just a few ty

mmaybe goats dont exist

@Logan M

Plain Text

# Create ingestion pipeline
def create_ingestion_pipeline() -> IngestionPipeline:
    # Node parser
    node_parser = SentenceSplitter(
        separator=" ",
        chunk_size=1024,
        chunk_overlap=200,
    )

    worker_llm = AzureOpenAI(
        temperature=0.1,
        model="gpt-3.5-turbo",
        max_tokens=512,
        engine=AZURE_OPENAI_API_DEPLOYMENT_NAME_GPT_4,
        azure_endpoint=AZURE_OPENAI_API_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )

    title_extractor = TitleExtractor(llm=worker_llm, num_workers=4)

    pipeline = IngestionPipeline(
        transformations=[
            node_parser,
            title_extractor,
        ]
    )
    return pipeline

def create_document_from_s3(s3_url: str) -> List[Document]:
    """Creates a document from an S3 URL."""
    # FIXME: will need to handle cruft
    # we have to download locally based on available research of data loading in llama-index
    file_path = download_file(s3_url)
    loader = UnstructuredReader()
    document = loader.load_data(file_path)
    delete_local_file(file_path)
    return document

main functionality

Plain Text

pipeline = create_ingestion_pipeline()
documents = create_document_from_s3(s3_url)
nodes = pipeline.run(document=documents)
vector_store = get_pinecone_vector_store()
vector_store.add(nodes=nodes)

value of documents when printed

Plain Text

[Document(id_='6c1580c3-7dc2-47b5-9909-9b0a60a8dfa2', embedding=None, metadata={'filename': '/tmp/llama-indexer/files/91301739-786a-45b3-b387-f5a7a7c35992-1689966736444-864818-sammyisagamer2.txt', 'user_id': '1'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='b10a29d59e236cfb927b37b012eeb2bf79e97898e16f7d4243947225baeb0bdd', text='Sammy roberts is a gamer', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]

LLogan M

nodes = pipeline.run(document=documents)

And nodes is empty?

mmaybe goats dont exist

yessir

mmaybe goats dont exist

also a secondary question is, is this a normal way to add nodes to an existing pinecone:

Plain Text

vector_store.add(nodes=nodes)

mmaybe goats dont exist

this is the output when i print nodes and print the results of the above vector store add

Plain Text

[]
Upserted vectors: 0it [00:00, ?it/s]
[]

LLogan M

Normally I would attach the vector store to the pipeline, it will insert for you

Plain Text

    pipeline = IngestionPipeline(
        transformations=[
            node_parser,
            title_extractor,
        ],
        vector_store=vector_store
    )

LLogan M

And I'm assuming that document print was before running the pipeline?

LLogan M

I'm low-key stumped. I would almost use a debugger at this point lol

mmaybe goats dont exist

okay, that makes sense, essentially I just want to be able to add user uploaded files to pinecone dynamically

mmaybe goats dont exist

yeah that makes sense ill jump into it with a debugger and see if its just the file potentially

LLogan M

Yea, like the same code works for me (minus the azure LLM).

you can narrow it down a bit too, and run each step

Plain Text

documents = ...
nodes = node_parser(documents)
nodes = title_extractor(nodes)

LLogan M

above is what the pipeline does lol

mmaybe goats dont exist

should I add the embedding model to the transformations?

mmaybe goats dont exist

i sometimes see it in transformation

mmaybe goats dont exist

okay ill try that as well

mmaybe goats dont exist

node_parser.get_nodes_from_documents(documents=documents) works lol

LLogan M

ah yes, definitely add embeddings to the pipeline

LLogan M

missed that 😅

mmaybe goats dont exist

does order matter for that?

mmaybe goats dont exist

I Do have it on the service context:

Plain Text

VectorStoreIndex.from_vector_store(
        service_context=get_service_context(),
        vector_store=vector_store,  # type: ignore
        storage_context=get_storage_context(),
    )

def get_service_context():
    llm = get_llm()
    embed_model = get_embed_model()
    service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)  # pyright: ignore[reportUnknownMemberType]
    return service_context

LLogan M

Order does matter 👀 Since you probably want to embed the nodes after splitting, and after attaching the new metadata

LLogan M

its in the service context, but the ingestion pipeline just runs what its given. And vector_store.add(nodes) requires the nodes to have embeddings attached

mmaybe goats dont exist

Okay so:

Plain Text

def create_ingestion_pipeline() -> IngestionPipeline:
    # Node parser
    node_parser = SentenceSplitter()

    worker_llm = AzureOpenAI(
        temperature=0.1,
        model="gpt-3.5-turbo",
        max_tokens=512,
        engine=AZURE_OPENAI_API_DEPLOYMENT_NAME_GPT_4,
        azure_endpoint=AZURE_OPENAI_API_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )

    title_extractor = TitleExtractor(llm=worker_llm)

    pipeline = IngestionPipeline(
        transformations=[node_parser, title_extractor, AzureOpenAIEmbedding()],
        vector_store=get_pinecone_vector_store(),
    )

    return pipeline

mmaybe goats dont exist

pipeline does not work, but doing the node parser alone on its own does work, not sure how to use title extractor alone

mmaybe goats dont exist

its quite annoying, i even tried defining the pipeline outside of the function in case some weird pass by ref thing was happening

LLogan M

oh the code I gave before is exactly what I would run

LLogan M

Plain Text

documents = ...
nodes = node_parser(documents)
nodes = title_extractor(nodes)
nodes = AzureOpenAIEmbedding(<azure kwargs>)(nodes)

LLogan M

each transformation has the __call__() method implemented, so you can chain them like that

mmaybe goats dont exist

hmm that was erroring for me

LLogan M

what kind of error?

mmaybe goats dont exist

nvm pylance was yelling at me, but it seems pylance sucks

mmaybe goats dont exist

it seems to be running now

LLogan M

🙏

LLogan M

So I guess with that, I hope that the problematic piece can be found 😅 if all three work, then I'm super lost

mmaybe goats dont exist

Yeah, its quite strange, just gotta figure out why pinecone randomly decided not to connect and hopefully this works and i can get off the computer

mmaybe goats dont exist

thank you

Add a reply

Find answers from the community

I'm running an ingestion pipeline on a