Find answers from the community

Home
Members
MitchMcD
M
MitchMcD
Offline, last seen 3 months ago
Joined September 25, 2024
I keep getting error when trying to use this new cheaper model gpt-4o-2024-08-06 for generating q&a , i did run the upgrade.
https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/
thank you
2 comments
L
J
I am evaluating various embedding models. What's the best way to modify this guide and change embedding models only while keeping all other variables constant? https://docs.llamaindex.ai/en/stable/examples/evaluation/retrieval/retriever_eval/
thank you
19 comments
M
R
J
can I extract page number and other metadata with llamaparse?
6 comments
L
M
what is the privacy policy of llamaparser? where the data is stored? who gets access? etc. thx
Helping some clients to parse complicated pdf documents.
18 comments
L
R
M
d
Plain Text
from llama_index.llms.openai import OpenAI
from llama_index.core.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

llm = OpenAI(model="gpt-4-0125-preview", temperature=0.1)

Settings.callback_manager = (callback_manager,)


Plain Text
ImportError: cannot import name 'OpenAIFineTuningHandler' from 'llama_index.core.callbacks' (/usr/local/lib/python3.10/dist-packages/llama_index/core/callbacks/__init__.py)
5 comments
L
M
i 've been trying to read pdfs, docx, directories but am still getting this error:
Plain Text
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("/Users/home/Library/Mobile Documents/com~apple~CloudDocs/Academia/Legal Research").load_data()
index = VectorStoreIndex.from_documents(documents)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
/Users/home/Downloads/OpenAI_Finetuning_Distill_GPT_4_to_GPT_3_5_(v2).ipynb Cell 13 line 1
----> 1 from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
      3 documents = SimpleDirectoryReader("/Users/home/Library/Mobile Documents/com~apple~CloudDocs/Academia/Legal Research").load_data()
      4 index = VectorStoreIndex.from_documents(documents)

File ~/.ooba/text-generation-ui/installer_files/env/lib/python3.10/site-packages/llama_index/__init__.py:21
     17 from llama_index.embeddings import OpenAIEmbedding
     19 # indices
     20 # loading
---> 21 from llama_index.indices import (
     22     ComposableGraph,
     23     DocumentSummaryIndex,
     24     GPTDocumentSummaryIndex,
     25     GPTKeywordTableIndex,
     26     GPTKnowledgeGraphIndex,
     27     GPTListIndex,
     28     GPTRAKEKeywordTableIndex,
     29     GPTSimpleKeywordTableIndex,
     30     GPTTreeIndex,
     31     GPTVectorStoreIndex,
     32     KeywordTableIndex,
     33     KnowledgeGraphIndex,
     34     ListIndex,
...
File ~/.ooba/text-generation-ui/installer_files/env/lib/python3.10/site-packages/pydantic/main.py:341, in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for DataSource

i 've installed all required packages i think with
Plain Text
pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface
19 comments
M
L
given the changes to ServiceContext can i still use this notebook for finetuning?
https://colab.research.google.com/drive/1NgyCJVyrC2xcZ5lxt2frTU862v6eJHlc?usp=sharing#scrollTo=J9vTWspqPwYY
4 comments
M
L
question for this esteemed community: when we split the data from csv into chunks , we embed chunks and do a vector search, but can 't see the full original text prior to splitting (unless you link it to a database that stores the full original texts).
what if we embed the metadata too? presumably, the metadata for the chunks coming from the same text that was split should be nearly identical, and the relevance score for those metadata vectors will be almost 1. if you need to find all chunks from the same original text and thus show you the full text, and not the excerpts, you just run another vector search against this one chunk. what do you think?
7 comments
M
L
M
MitchMcD
·

Query

I have a table hosted on supabase with id, content, metadata, embeddings. did not use llama or langchain to create it.
is it possible to run an index over it and start querying using llama-index? thanks!
4 comments
L
M
hello, for this tutorial, how do i call my fine-tuned model, not the base one? thx https://gpt-index.readthedocs.io/en/v0.8.45/examples/llm/gradient_base_model.html
3 comments
E
hey llama team. i think i saw somewhere you open sourced this interface. but i can't find it . can you pls help me locate it. thanks much!
8 comments
M
W
L
This is probably such a simple question and the answer is probably written someone on the Docs page, but I could not find it. How do I preserve a pdf page number for a long pdf , so that when getting vector search (or any other) results, it shows an excerpt + a page number? Thank you
5 comments
b
M
hi all, if i am embedding a book, is there a way to make each chapter a node, so that when i ask "summarize chapter x" it works through this node only? would appreciate guidance. thank you!
11 comments
M
W
b
do i get it right that we can now upload all books with llava, not worrying about chunking? storage wise, are images converted into similar vector storage / dimensions or will they require more space?
4 comments
d
M
W
why does not pandas index read the headers of df first? e.g. one of my columns is 'Ship State' but llama keeps using just 'State' in the queries
24 comments
L
M
I have just completed the fine tuning of 3.5turbo ("Fine-tuning to Memorize Knowledge"). Here are the results if anyone is thinking about fine tuning vs RAGing:
  • first off, shout-out to the Llama-Index team and for putting in place this well thought out and very comprehensive guide and eval framework, and for helping me resolve some of the issues. There are some minor hiccups with the code but you can resolve those easily if they get flagged on your machine;
  • I fine tuned on a legal textbook "Legal Research, Analysis & Writing". It is a very foundational treatise if you want to become a lawyer;
  • ~1800 questions / answers, split into 70/30 for train / val;
  • did two iterations. first , fine tuned using train / val datasets, and then fined tuned again the already fine tuned model by using the complete dataset as training only;
  • you will see in the code that ground truth (gt) comes from gpt4, while the base is gpt3.5. while i am bit hesitant to compare apples to oranges, for the main question i was after (is RAG framework a solid approach), it worked;
  • objective results:
'ft_rag_score': 0.775,
'ft_score': 0.725,
'rag_score': 0.825,
'base_score': 0.675
as you can see 'rag only' wins.
  • the temperature was 0 for all models
  • in terms of legal style, vocabulary and coherence, gpt3.5 turbo is already quite good on the 'how' part (can explain how to write a legal memo, how to research a case, etc), but still hallucinates a lot when it comes to 'what' part (what is case x about, when was statute y enacted, etc, );
  • i was surprised to see that ft_rag is a little worse than rag only , but it is good to know that grounding models in existing knowledge works great 🙂 ;
  • happy to help or answer any questions. thanks again LlamaIndex team for doing what you are doing.
1 comment
j
Is it possible to extract information from 'Panda Instructions' to show a corresponding graph in plt?
2 comments
L
can i manually correct the pd formula for panda index? it keeps giving me the wrong response. e.g. i ask for the sales on the busiest day and it keeps giving me the sales for the whole year
3 comments
M
L
Plain Text
index.query("What is the document about?")

Plain Text
'VectorStoreIndex' object has no attribute 'query'

@Logan M what's wrong with query command? thanks!
2 comments
L
should i use another encoder?

Plain Text
from llama_index.readers.file import FlatReader
from pathlib import Path

reader = FlatReader()
docs_2021 = reader.load_data(Path("my_file.pdf"))


Plain Text
File ~/.local/lib/python3.10/site-packages/llama_index/readers/file/flat/base.py:28, in FlatReader.load_data(self, file, extra_info)
     26 """Parse file into string."""
     27 with open(file, encoding="utf-8") as f:
---> 28     content = f.read()
     29 metadata = {"filename": file.name, "extension": file.suffix}
     30 if extra_info:

File /usr/lib/python3.10/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
1 comment
L
This is probably a very stupid question, my apology in advance.

If I follow any of these two guides:
https://docs.llamaindex.ai/en/stable/examples/usecases/10q_sub_question.html
https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
And I use say Anthropic for llm, and OpenAI for embedding. Is my data accessible only to OpenAI and Anthropic? Or does it get shared with Llama-index too?
13 comments
L
a
M
i am going through this tutorial (thank you!). i loaded csv file but when trying to create index with
Plain Text
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

i keep getting this error : AttributeError: 'Document' object has no attribute 'get_doc_id'
any advice would be much appreciated! thank you!
https://docs.llamaindex.ai/en/stable/examples/vector_stores/Timescalevector.html#load-documents-and-metadata-into-timescalevector-vectorstore
1 comment
b
i am trying to generate a dataset for fine tuning on HF. is it possible to extract also 'context' when running this code? which i presume will be the top chunk from vector search if i run a question , or 3-5 sentences surrounding the answer.
or should i do it as a separate step (go through the list of generated questions to find the closest vector)?
thank you!
ps. i tried to amend the prompt to ensure the output contains 'context', 'question', 'answer' but i am getting non-sensical response and format.
Plain Text
question_gen_query = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)

dataset_generator = DatasetGenerator.from_documents(
    documents[:50],
    question_gen_query=question_gen_query,
    service_context=gpt_35_context,
) 
49 comments
M
L
I am trying to load pdf tables , and am using PDFTableReader() according to llama-hub. But am getting this error , Should I replace PDFTableReader with PyPDF? Will it read the tables properly?
Plain Text
1973 def __init__(self, *args: Any, **kwargs: Any) -> None:
-> 1974     deprecation_with_replacement("PdfFileReader", "PdfReader", "3.0.0")
   1975     if "strict" not in kwargs and len(args) < 2:
   1976         kwargs["strict"] = True  # maintain the default

File ~/anaconda3/lib/python3.10/site-packages/PyPDF2/_utils.py:369, in deprecation_with_replacement(old_name, new_name, removed_in)
    363 def deprecation_with_replacement(
    364     old_name: str, new_name: str, removed_in: str = "3.0.0"
    365 ) -> None:
    366     """
    367     Raise an exception that a feature was already removed, but has a replacement.
    368     """
--> 369     deprecation(DEPR_MSG_HAPPENED.format(old_name, removed_in, new_name))

File ~/anaconda3/lib/python3.10/site-packages/PyPDF2/_utils.py:351, in deprecation(msg)
    350 def deprecation(msg: str) -> None:
--> 351     raise DeprecationError(msg)

DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.
7 comments
M
L
hi all, would appreciate some direction on whether llama-index can help for my scenario:
say i have two manuals : how to fix a bike and how to fix a car. both are very long docts.
i ask a question about fixing a car.
i need an agent to figure out which manual to consult with -> consult with the appropriate manual -> give me the right answer.
can you pls recommend what i could employ for this scenario?
thx much!
12 comments
M
L