Gianluca

Worked with web reader

Hi, do you think is a good approach to create a ingestion pipeline with documents from SimpleDirectoryReader and nodes from HTML files parsed with HTMLNodeParser?

1 comment

GGianluca

Hi guys, I am having problems to run a HuggingFaceLLM with a local model, I want to run the index completely offline. But can't instantiate the tokenizer from .bin file downloaded from internet.
This is my code:

Plain Text

model_path = "./models/llama-2-13b.ggmlv3.q4_0.bin"

(line 30)tokenizer = AutoTokenizer.from_pretrained(model_path)

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    model=model_path,
    tokenizer=tokenizer,
    device_map="cpu",
)

Plain Text

Traceback (most recent call last):
  File "/home/{path}/example1.py", line 30, in <module>
    tokenizer = AutoTokenizer.from_pretrained(model_path)
  File "/home/{path}/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 652, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "/home/{path}/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 496, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "/home/{path}/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
  File "/home/{path}/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "/home/{path}/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './models/llama-2-13b.ggmlv3.q4_0.bin'. Use `repo_type` argument if needed.

Sorry if duplicated, I dont found using discord search

8 comments

GGianluca

Hi, exist some concept in llama index

Hi, exist some concept in llama index that passes the input trought the llm before querying the index?
For example, "found all the ips in the document", IP have standar format like x.x.x.x, maybe processing the input adding ip information helps to get better info from documents?

5 comments

GGianluca

Splitters

I have exported .docx from google and I have this type of newlines, is this text splitter a good solution?

Plain Text

text_splitter = TokenTextSplitter(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  backup_separators=["\n", "\n\n", "\n\n\n", "\n\n\n\n", "\n\n\n\n\n", "\n\n\n\n\n\n", "\n\n\n\n\n\n\n", "\n\n\n\n\n\n\n\n", "\n\n\n\n\n\n\n\n\n", "\n\n\n\n\n\n\n\n\n\n"]
)

3 comments

Find answers from the community

Worked with web reader

GGML model

Hi, exist some concept in llama index

Splitters