jalateras1963

jjalateras1963

Hey all, i have a general question. I am

Hey all, i have a general question. I am looking at creating a RAG application using all our issues on slack. Each issues usually has a stack trace and general error message, the script that was in error and a bunch of other data. I was thinking of developing a streamlit application to process each message. It would display the message and then i can add a resolution to it. in a stepwise fashion. I thne want to store the issue and resolution in a vector database but not sure whether i should attach the resolution as metadata or just chunk the document with the issue and resolution before storing it in the database. Should I chunk at all or store the stack trace as a single document. Would appreciate any insights

4 comments

jjalateras1963

I have a very generic question and

I have a very generic question and wondering whether llama-index would help me here. I would like to integrate with Athena and read the schema of the specified table, craft the sql query, execute it on Athena and then return the results on something like a streamlit application. From the users perspective i would like them to enter something like

From the event model can you show me the number of minutes viewed by device between the dates 20230101 and 20230131 inclusive

. Is this something that i can do via llama-index or should i be using another framework. My approach would be to read the event schema from Athena, pass it in as part of the context, along with the user request and then send it to OpenAPI

1 comment

jjalateras1963

i am using an ingrstion pipeline to

i am using an ingrstion pipeline to ceate a bunch of embeddings for a pdf article and store them in chromadb. What is the best practice for updating these embeddings. Currently when i process the same file twice, it inserts another lot of embeddings.

1 comment

jjalateras1963

how would i go about parsing the

how would i go about parsing the following response from an llm. I tried to tell it to ignore the markup tags

Plain Text

```json
[
    {
        "question": "What problem does Bitcoin aim to solve in the context of online payments?"
    },
    {
        "question": "How does Bitcoin propose to prevent double-spending without a trusted third party?"
    },
    {
        "question": "What role does the proof-of-work chain play in the Bitcoin network?"
    },
    {
        "question": "Why is the longest chain in the Bitcoin network considered authoritative?"
    },
    {
        "question": "How does the Bitcoin network ensure its integrity against attackers?"
    },
    {
        "question": "What are the inherent weaknesses of the trust-based model in traditional electronic payments?"
    },
    {
        "question": "How does the requirement for mediation by financial institutions affect transaction costs and sizes?"
    },
    {
        "question": "What impact does the possibility of transaction reversal have on merchants and customers?"
    },
    {
        "question": "In what way does Bitcoin's peer-to-peer network maintain minimal structure?"
    },
    {
        "question": "How does the ability to make non-reversible payments benefit transactions for non-reversible services?"
    }
]

```

1 comment

jjalateras1963

I am using the following code to ingest

I am using the following code to ingest a document into a vector store

Plain Text

def process_document(dbdir):
    chroma_client = chromadb.PersistentClient(path=dbdir)
    chroma_collection = chroma_client.get_or_create_collection("bitcoin")
    vector_store = ChromaVectorStore(chroma_collection)

    llm = OpenAI(model="gpt-4-0125-preview")

    loader = PyMuPDFReader()
    docs = loader.load_data(file_path=os.path.join(os.path.dirname(__file__), "..", "docs", "bitcoin.pdf"))
    for doc in docs:
        doc.id_ = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
    click.echo(f"Loaded {len(docs)} documents")

    embed_model = OpenAIEmbedding()

    extractors = [
        SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model),
        TitleExtractor(nodes=5),
        SummaryExtractor(summaries=["prev", "self", "next"]),
        QuestionsAnsweredExtractor(questions=10, metadata=MetadataMode.EMBED),
        KeywordExtractor(keywords=5),
        embed_model
    ]

    pipeline = IngestionPipeline(transformations=extractors, vector_store=vector_store, cache=IngestionCache())
    processed_nodes = pipeline.run(documents=docs, show_progress=True, store_doc_text=True, store_doc_metadata=True)
    click.echo(f"Processed {len(processed_nodes)} nodes")

How would i use refresh_ref_docs so that when i run the same document again it doesnb't create duplicate entries but updates the associated metadata and embeddings. I use the hash of the content to create my doc_id but whenever i try to add code that calls refresh i get the following error

Plain Text

An error occurred: 'TextNode' object has no attribute 'get_doc_id'

Can i do a refresh as part of my ingest pipeline

6 comments

jjalateras1963

i am using llama index to read my slack

i am using llama index to read my slack channel. I then ask it to summarize the events from the support channel and it returns them as a list of dot points. Is there a way to format the response in markdown or similar

2 comments

jjalateras1963

Service context

I am going through the tutorials and foubd the following lines do not work with 0.7.12

Plain Text

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=1000)

2 comments

Find answers from the community

Hey all, i have a general question. I am

I have a very generic question and

i am using an ingrstion pipeline to

how would i go about parsing the

I am using the following code to ingest

i am using llama index to read my slack

Service context