MitchMcD

I keep getting error when trying to use

I keep getting error when trying to use this new cheaper model gpt-4o-2024-08-06 for generating q&a , i did run the upgrade.
https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/
thank you

2 comments

MMitchMcD

I am evaluating various embedding models

I am evaluating various embedding models. What's the best way to modify this guide and change embedding models only while keeping all other variables constant? https://docs.llamaindex.ai/en/stable/examples/evaluation/retrieval/retriever_eval/
thank you

19 comments

MMitchMcD

can I extract page number and other

can I extract page number and other metadata with llamaparse?

6 comments

MMitchMcD

@Logan M what is the privacy policy of

what is the privacy policy of llamaparser? where the data is stored? who gets access? etc. thx
Helping some clients to parse complicated pdf documents.

18 comments

MMitchMcD

```from llama_index.llms.openai import

Plain Text

from llama_index.llms.openai import OpenAI
from llama_index.core.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

llm = OpenAI(model="gpt-4-0125-preview", temperature=0.1)

Settings.callback_manager = (callback_manager,)

Plain Text

ImportError: cannot import name 'OpenAIFineTuningHandler' from 'llama_index.core.callbacks' (/usr/local/lib/python3.10/dist-packages/llama_index/core/callbacks/__init__.py)

5 comments

MMitchMcD

i 've been trying to read pdfs, docx,

i 've been trying to read pdfs, docx, directories but am still getting this error:

Plain Text

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("/Users/home/Library/Mobile Documents/com~apple~CloudDocs/Academia/Legal Research").load_data()
index = VectorStoreIndex.from_documents(documents)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
/Users/home/Downloads/OpenAI_Finetuning_Distill_GPT_4_to_GPT_3_5_(v2).ipynb Cell 13 line 1
----> 1 from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
      3 documents = SimpleDirectoryReader("/Users/home/Library/Mobile Documents/com~apple~CloudDocs/Academia/Legal Research").load_data()
      4 index = VectorStoreIndex.from_documents(documents)

File ~/.ooba/text-generation-ui/installer_files/env/lib/python3.10/site-packages/llama_index/__init__.py:21
     17 from llama_index.embeddings import OpenAIEmbedding
     19 # indices
     20 # loading
---> 21 from llama_index.indices import (
     22     ComposableGraph,
     23     DocumentSummaryIndex,
     24     GPTDocumentSummaryIndex,
     25     GPTKeywordTableIndex,
     26     GPTKnowledgeGraphIndex,
     27     GPTListIndex,
     28     GPTRAKEKeywordTableIndex,
     29     GPTSimpleKeywordTableIndex,
     30     GPTTreeIndex,
     31     GPTVectorStoreIndex,
     32     KeywordTableIndex,
     33     KnowledgeGraphIndex,
     34     ListIndex,
...
File ~/.ooba/text-generation-ui/installer_files/env/lib/python3.10/site-packages/pydantic/main.py:341, in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for DataSource

i 've installed all required packages i think with

Plain Text

pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface

19 comments

MMitchMcD

Google Colaboratory

given the changes to ServiceContext can i still use this notebook for finetuning?
https://colab.research.google.com/drive/1NgyCJVyrC2xcZ5lxt2frTU862v6eJHlc?usp=sharing#scrollTo=J9vTWspqPwYY

4 comments

MMitchMcD

question for this esteemed community

question for this esteemed community: when we split the data from csv into chunks , we embed chunks and do a vector search, but can 't see the full original text prior to splitting (unless you link it to a database that stores the full original texts).
what if we embed the metadata too? presumably, the metadata for the chunks coming from the same text that was split should be nearly identical, and the relevance score for those metadata vectors will be almost 1. if you need to find all chunks from the same original text and thus show you the full text, and not the excerpts, you just run another vector search against this one chunk. what do you think?

7 comments

MMitchMcD

Query

I have a table hosted on supabase with id, content, metadata, embeddings. did not use llama or langchain to create it.
is it possible to run an index over it and start querying using llama-index? thanks!

4 comments

MMitchMcD

Custom Model

hello, for this tutorial, how do i call my fine-tuned model, not the base one? thx https://gpt-index.readthedocs.io/en/v0.8.45/examples/llm/gradient_base_model.html

3 comments

MMitchMcD

hey llama team i think i saw somewhere

hey llama team. i think i saw somewhere you open sourced this interface. but i can't find it . can you pls help me locate it. thanks much!

8 comments

MMitchMcD

pdf page metadata

This is probably such a simple question and the answer is probably written someone on the Docs page, but I could not find it. How do I preserve a pdf page number for a long pdf , so that when getting vector search (or any other) results, it shows an excerpt + a page number? Thank you

5 comments

MMitchMcD

Chapterwise nodes

hi all, if i am embedding a book, is there a way to make each chapter a node, so that when i ask "summarize chapter x" it works through this node only? would appreciate guidance. thank you!

11 comments

MMitchMcD

do i get it right that we can now upload

do i get it right that we can now upload all books with llava, not worrying about chunking? storage wise, are images converted into similar vector storage / dimensions or will they require more space?

4 comments

MMitchMcD

why does not pandas index read the

why does not pandas index read the headers of df first? e.g. one of my columns is 'Ship State' but llama keeps using just 'State' in the queries

24 comments

MMitchMcD

I have just completed the fine tuning of

I have just completed the fine tuning of 3.5turbo ("Fine-tuning to Memorize Knowledge"). Here are the results if anyone is thinking about fine tuning vs RAGing:

first off, shout-out to the Llama-Index team and for putting in place this well thought out and very comprehensive guide and eval framework, and for helping me resolve some of the issues. There are some minor hiccups with the code but you can resolve those easily if they get flagged on your machine;
I fine tuned on a legal textbook "Legal Research, Analysis & Writing". It is a very foundational treatise if you want to become a lawyer;
~1800 questions / answers, split into 70/30 for train / val;
did two iterations. first , fine tuned using train / val datasets, and then fined tuned again the already fine tuned model by using the complete dataset as training only;
you will see in the code that ground truth (gt) comes from gpt4, while the base is gpt3.5. while i am bit hesitant to compare apples to oranges, for the main question i was after (is RAG framework a solid approach), it worked;
objective results:

'ft_rag_score': 0.775,
'ft_score': 0.725,
'rag_score': 0.825,
'base_score': 0.675
as you can see 'rag only' wins.

the temperature was 0 for all models
in terms of legal style, vocabulary and coherence, gpt3.5 turbo is already quite good on the 'how' part (can explain how to write a legal memo, how to research a case, etc), but still hallucinates a lot when it comes to 'what' part (what is case x about, when was statute y enacted, etc, );
i was surprised to see that ft_rag is a little worse than rag only , but it is good to know that grounding models in existing knowledge works great 🙂 ;
happy to help or answer any questions. thanks again LlamaIndex team for doing what you are doing.

1 comment

MMitchMcD

Is it possible to extract information

Is it possible to extract information from 'Panda Instructions' to show a corresponding graph in plt?

2 comments

MMitchMcD

can i manually correct the pd formula

can i manually correct the pd formula for panda index? it keeps giving me the wrong response. e.g. i ask for the sales on the busiest day and it keeps giving me the sales for the whole year

3 comments

MMitchMcD

```index.query("What is the document

Plain Text

index.query("What is the document about?")

Plain Text

'VectorStoreIndex' object has no attribute 'query'

@Logan M what's wrong with query command? thanks!

2 comments

MMitchMcD

should i use another encoder?

Plain Text

from llama_index.readers.file import FlatReader
from pathlib import Path

reader = FlatReader()
docs_2021 = reader.load_data(Path("my_file.pdf"))

Plain Text

File ~/.local/lib/python3.10/site-packages/llama_index/readers/file/flat/base.py:28, in FlatReader.load_data(self, file, extra_info)
     26 """Parse file into string."""
     27 with open(file, encoding="utf-8") as f:
---> 28     content = f.read()
     29 metadata = {"filename": file.name, "extension": file.suffix}
     30 if extra_info:

File /usr/lib/python3.10/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

1 comment

MMitchMcD

llama_parse/examples/demo_advanced.ipynb...

This is probably a very stupid question, my apology in advance.

If I follow any of these two guides:
https://docs.llamaindex.ai/en/stable/examples/usecases/10q_sub_question.html
https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb
And I use say Anthropic for llm, and OpenAI for embedding. Is my data accessible only to OpenAI and Anthropic? Or does it get shared with Llama-index too?

13 comments

MMitchMcD

i am going through this tutorial (thank

i am going through this tutorial (thank you!). i loaded csv file but when trying to create index with

Plain Text

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

i keep getting this error : AttributeError: 'Document' object has no attribute 'get_doc_id'
any advice would be much appreciated! thank you!
https://docs.llamaindex.ai/en/stable/examples/vector_stores/Timescalevector.html#load-documents-and-metadata-into-timescalevector-vectorstore

1 comment

MMitchMcD

i am trying to generate a dataset for

i am trying to generate a dataset for fine tuning on HF. is it possible to extract also 'context' when running this code? which i presume will be the top chunk from vector search if i run a question , or 3-5 sentences surrounding the answer.
or should i do it as a separate step (go through the list of generated questions to find the closest vector)?
thank you!
ps. i tried to amend the prompt to ensure the output contains 'context', 'question', 'answer' but i am getting non-sensical response and format.

Plain Text

question_gen_query = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)

dataset_generator = DatasetGenerator.from_documents(
    documents[:50],
    question_gen_query=question_gen_query,
    service_context=gpt_35_context,
)

49 comments

MMitchMcD

I am trying to load pdf tables and am

I am trying to load pdf tables , and am using PDFTableReader() according to llama-hub. But am getting this error , Should I replace PDFTableReader with PyPDF? Will it read the tables properly?

Plain Text

1973 def __init__(self, *args: Any, **kwargs: Any) -> None:
-> 1974     deprecation_with_replacement("PdfFileReader", "PdfReader", "3.0.0")
   1975     if "strict" not in kwargs and len(args) < 2:
   1976         kwargs["strict"] = True  # maintain the default

File ~/anaconda3/lib/python3.10/site-packages/PyPDF2/_utils.py:369, in deprecation_with_replacement(old_name, new_name, removed_in)
    363 def deprecation_with_replacement(
    364     old_name: str, new_name: str, removed_in: str = "3.0.0"
    365 ) -> None:
    366     """
    367     Raise an exception that a feature was already removed, but has a replacement.
    368     """
--> 369     deprecation(DEPR_MSG_HAPPENED.format(old_name, removed_in, new_name))

File ~/anaconda3/lib/python3.10/site-packages/PyPDF2/_utils.py:351, in deprecation(msg)
    350 def deprecation(msg: str) -> None:
--> 351     raise DeprecationError(msg)

DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

7 comments

MMitchMcD

hi all would appreciate some direction

hi all, would appreciate some direction on whether llama-index can help for my scenario:
say i have two manuals : how to fix a bike and how to fix a car. both are very long docts.
i ask a question about fixing a car.
i need an agent to figure out which manual to consult with -> consult with the appropriate manual -> give me the right answer.
can you pls recommend what i could employ for this scenario?
thx much!

12 comments

Find answers from the community

I keep getting error when trying to use

I am evaluating various embedding models

can I extract page number and other

@Logan M what is the privacy policy of

```from llama_index.llms.openai import

i 've been trying to read pdfs, docx,

Google Colaboratory

question for this esteemed community

Query

Custom Model

hey llama team i think i saw somewhere

pdf page metadata

Chapterwise nodes

do i get it right that we can now upload

why does not pandas index read the

I have just completed the fine tuning of

Is it possible to extract information

can i manually correct the pd formula

```index.query("What is the document

should i use another encoder?

llama_parse/examples/demo_advanced.ipynb...

i am going through this tutorial (thank

i am trying to generate a dataset for

I am trying to load pdf tables and am

hi all would appreciate some direction