Lotar

I am trying to implement a

I am trying to implement a recommendation from https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/dev_practices/production_rag.html to "decouple chunks used for retrieval from chunks used for synthesis" by

1.) generating a summary for each node
2.) storing an embedding of a summary along with the original text corresponding to the summary
3.) using the summary embedding during the retrieval step
4.) using the original text during the synthesis step

I was hoping that using DocumentSummaryIndex as recommended in the above-linked documentation will be the simplest way to do that, however, I noticed that this index persists summaries and their embeddings into DocStore, not VectorStore (in my case, MongoDB). I am wondering about the performance of this in production scenarios (with tens of thousands of chunks). I'd like to find a solution with llama-index where embeddings would be stored in a vector database.

so far, what I've managed to come up with is to use plain old VectorStoreIndex and a custom subclass of BaseEmbedding, in which I would call LLM to generate a summary of a node and instead of storing embeddings of the node, I would store embeddings of a summary. this feels hackish to me, is there a better approach somebody can think? ideally, I am looking for something that would enable me to preserve also the summaries in their textual form, not only as embeddings

5 comments

LLotar

has something recently changed with

has something recently changed with MilvusVectorStore? when inserting new documents created as Document(text=segment_text, doc_id=doc_id, extra_info=extra_info), only text field is stored into text column in Milvus without any extra info. previously, extra info were at the beginning of each text

6 comments

LLotar

Lots of more advanced RAG techniques e g

Lots of more advanced RAG techniques (e.g. those outlined in https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/dev_practices/production_rag.html) rely on generation of summaries from data that is being embedded. What do you use to generate these summaries? I find that GPT4 is not usable with larger datasets because of poor performance and too low rate limits and GPT3.5 sometimes does not generate good enough summaries. Are there any alternatives?

1 comment

LLotar

RAG from scratch

what resources (books, article series, online courses, anything else) would you recommend for a summary of strategies on how to build a RAG system? I am looking for some tips on how to decide on best way to index content, what querying strategy to choose, how to incorporate user feedback etc - all necessary parts of working RAG. I am in process of building a RAG system of my own, but it very often feels like I am trying to reinvent the wheel (even when using llama-index, which has been a tremendous help) and I am curious to learn what folks here have been using to get ahead. any tips will be very much appreciated

7 comments

LLotar

Streaming

I am running into problems when trying to work with streamed query response. everything works correctly when using the following code and flask development server:

Plain Text

response = query_engine.query(question)
full_answer = ""
for token in response.response_gen:
    full_answer = full_answer + token
    emit("answer", {"token": token})

however, when using gunicorn and eventlet or gevent worker (need to use one of those because I want to use websockets to be able to stream the response to client), the code hangs at the for loop line and no iteration of the loop is executed. I assume the code needs to written differently to work with the gunicorn workers, does anybody have any experience with this?

8 comments

LLotar

when using query engine with streaming

when using query engine with streaming = true and examining response.source_nodes.extra_info, all the values are None, although the text of the nodes contains extra_info at the top. what's the easiest way to retrieve extra_info of the nodes that were used to generate a response?

I am loading the index from storage like this:

Plain Text

load_index_from_storage(storage_context=self._get_storage_context(
                collection_name=self.id, persist_dir=os.path.join(self.base_persist_dir, self.id)), service_context=self._get_service_context())

and these are the supporting methods:

Plain Text

def get_query_engine(self, prompt_template=None, refine_template=None) -> BaseQueryEngine:
        return RetrieverQueryEngine(
            retriever=VectorIndexRetriever(
                index=self.index, similarity_top_k=math.floor((4097 - 300) / self.max_chunk_size)),
            response_synthesizer=ResponseSynthesizer.from_args(
                response_mode=ResponseMode.COMPACT,
                service_context=self._get_service_context(),
                streaming=self.streaming,
                text_qa_template=QuestionAnswerPrompt(prompt_template),
                refine_template=RefinePrompt(refine_template)))

    def _get_storage_context(self, collection_name, persist_dir=None):
        return StorageContext.from_defaults(
            vector_store=MilvusVectorStore(
                collection_name=collection_name, overwrite=self.overwrite),
            docstore=MongoDocumentStore.from_uri(...),
            persist_dir=persist_dir)

    def _get_service_context(self):
        llm_predictor = LLMPredictor(llm=ChatOpenAI(
            temperature=0, model="gpt-3.5-turbo", streaming=self.streaming))
        prompt_helper = PromptHelper()
        service_context = ServiceContext.from_defaults(
            chunk_size=self.max_chunk_size, llm_predictor=llm_predictor, prompt_helper=prompt_helper)
        return service_context

12 comments

LLotar

is there a way to set the max number of

is there a way to set the max number of refinement rounds with response_mode = "compact"? because of the nature of my data, I need to go with similarity_top_k = 8 and sometimes, it yields too many open ai requests - would like to limit it

5 comments

LLotar

When using GPTKnowledgeGraphIndex with

When using GPTKnowledgeGraphIndex with MongoIndexStore and include_embeddings=True, I am running into an error with Mongo:

pymongo.errors.DocumentTooLarge: 'update' command document too large

I guess the reason for that is all the embeddings are stored in one mongo doc. my UC is that I am experimenting with KG and have a couple of hundreds of pages of text and need to store triplets and their embeddings somehow.

are there any plans to make the storage more scalable, or is there a better way to achieve this and I am going at it from a totally wrong angle? thank you!

3 comments

Find answers from the community

I am trying to implement a

has something recently changed with

Lots of more advanced RAG techniques e g

RAG from scratch

Streaming

when using query engine with streaming

is there a way to set the max number of

When using GPTKnowledgeGraphIndex with