Find answers from the community

Home
Members
Gianluca
G
Gianluca
Offline, last seen 2 weeks ago
Joined September 25, 2024
Hi, do you think is a good approach to create a ingestion pipeline with documents from SimpleDirectoryReader and nodes from HTML files parsed with HTMLNodeParser?
1 comment
G
Hi guys, I am having problems to run a HuggingFaceLLM with a local model, I want to run the index completely offline. But can't instantiate the tokenizer from .bin file downloaded from internet.
This is my code:
Plain Text
model_path = "./models/llama-2-13b.ggmlv3.q4_0.bin"

(line 30)tokenizer = AutoTokenizer.from_pretrained(model_path)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    model=model_path,
    tokenizer=tokenizer,
    device_map="cpu",
)

Plain Text
Traceback (most recent call last):
  File "/home/{path}/example1.py", line 30, in <module>
    tokenizer = AutoTokenizer.from_pretrained(model_path)
  File "/home/{path}/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 652, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/home/{path}/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 496, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/home/{path}/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
  File "/home/{path}/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/{path}/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/llama-2-13b.ggmlv3.q4_0.bin'. Use `repo_type` argument if needed.

Sorry if duplicated, I dont found using discord search
8 comments
G
L
Hi, exist some concept in llama index that passes the input trought the llm before querying the index?
For example, "found all the ips in the document", IP have standar format like x.x.x.x, maybe processing the input adding ip information helps to get better info from documents?
5 comments
G
L
I have exported .docx from google and I have this type of newlines, is this text splitter a good solution?
Plain Text
text_splitter = TokenTextSplitter(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  backup_separators=["\n", "\n\n", "\n\n\n", "\n\n\n\n", "\n\n\n\n\n", "\n\n\n\n\n\n", "\n\n\n\n\n\n\n", "\n\n\n\n\n\n\n\n", "\n\n\n\n\n\n\n\n\n", "\n\n\n\n\n\n\n\n\n\n"]
)
3 comments
G
L