Find answers from the community

Home
Members
jalateras1963
j
jalateras1963
Offline, last seen 4 months ago
Joined September 25, 2024
Hey all, i have a general question. I am looking at creating a RAG application using all our issues on slack. Each issues usually has a stack trace and general error message, the script that was in error and a bunch of other data. I was thinking of developing a streamlit application to process each message. It would display the message and then i can add a resolution to it. in a stepwise fashion. I thne want to store the issue and resolution in a vector database but not sure whether i should attach the resolution as metadata or just chunk the document with the issue and resolution before storing it in the database. Should I chunk at all or store the stack trace as a single document. Would appreciate any insights
4 comments
j
W
I have a very generic question and wondering whether llama-index would help me here. I would like to integrate with Athena and read the schema of the specified table, craft the sql query, execute it on Athena and then return the results on something like a streamlit application. From the users perspective i would like them to enter something like From the event model can you show me the number of minutes viewed by device between the dates 20230101 and 20230131 inclusive. Is this something that i can do via llama-index or should i be using another framework. My approach would be to read the event schema from Athena, pass it in as part of the context, along with the user request and then send it to OpenAPI
1 comment
L
i am using an ingrstion pipeline to ceate a bunch of embeddings for a pdf article and store them in chromadb. What is the best practice for updating these embeddings. Currently when i process the same file twice, it inserts another lot of embeddings.
1 comment
R
how would i go about parsing the following response from an llm. I tried to tell it to ignore the markup tags


Plain Text
```json
[
    {
        "question": "What problem does Bitcoin aim to solve in the context of online payments?"
    },
    {
        "question": "How does Bitcoin propose to prevent double-spending without a trusted third party?"
    },
    {
        "question": "What role does the proof-of-work chain play in the Bitcoin network?"
    },
    {
        "question": "Why is the longest chain in the Bitcoin network considered authoritative?"
    },
    {
        "question": "How does the Bitcoin network ensure its integrity against attackers?"
    },
    {
        "question": "What are the inherent weaknesses of the trust-based model in traditional electronic payments?"
    },
    {
        "question": "How does the requirement for mediation by financial institutions affect transaction costs and sizes?"
    },
    {
        "question": "What impact does the possibility of transaction reversal have on merchants and customers?"
    },
    {
        "question": "In what way does Bitcoin's peer-to-peer network maintain minimal structure?"
    },
    {
        "question": "How does the ability to make non-reversible payments benefit transactions for non-reversible services?"
    }
]

```
1 comment
W
I am using the following code to ingest a document into a vector store

Plain Text
def process_document(dbdir):
    chroma_client = chromadb.PersistentClient(path=dbdir)
    chroma_collection = chroma_client.get_or_create_collection("bitcoin")
    vector_store = ChromaVectorStore(chroma_collection)

    llm = OpenAI(model="gpt-4-0125-preview")

    loader = PyMuPDFReader()
    docs = loader.load_data(file_path=os.path.join(os.path.dirname(__file__), "..", "docs", "bitcoin.pdf"))
    for doc in docs:
        doc.id_ = hashlib.sha256(doc.text.encode('utf-8')).hexdigest()
    click.echo(f"Loaded {len(docs)} documents")

    embed_model = OpenAIEmbedding()

    extractors = [
        SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model),
        TitleExtractor(nodes=5),
        SummaryExtractor(summaries=["prev", "self", "next"]),
        QuestionsAnsweredExtractor(questions=10, metadata=MetadataMode.EMBED),
        KeywordExtractor(keywords=5),
        embed_model
    ]

    pipeline = IngestionPipeline(transformations=extractors, vector_store=vector_store, cache=IngestionCache())
    processed_nodes = pipeline.run(documents=docs, show_progress=True, store_doc_text=True, store_doc_metadata=True)
    click.echo(f"Processed {len(processed_nodes)} nodes")


How would i use refresh_ref_docs so that when i run the same document again it doesnb't create duplicate entries but updates the associated metadata and embeddings. I use the hash of the content to create my doc_id but whenever i try to add code that calls refresh i get the following error

Plain Text
An error occurred: 'TextNode' object has no attribute 'get_doc_id'


Can i do a refresh as part of my ingest pipeline
6 comments
L
j
i am using llama index to read my slack channel. I then ask it to summarize the events from the support channel and it returns them as a list of dot points. Is there a way to format the response in markdown or similar
2 comments
j
L
I am going through the tutorials and foubd the following lines do not work with 0.7.12

Plain Text
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=1000)
2 comments
j
L