Find answers from the community

Updated 3 months ago

Metadata

EntityExtractor for some reason does not wxtract any entities for me. I have installed the span-marker, nltk and punks. Metadata stays empty. What can I check? I'm using it with a long txt file with content.
L
J
21 comments
Hmmm that's weird 🤔 can you share the code you have?
from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
MetadataExtractor,
SummaryExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
KeywordExtractor,
EntityExtractor,
MetadataFeatureExtractor,
)
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)





metadata_extractor = MetadataExtractor(
extractors=[
EntityExtractor(prediction_threshold=0.5), # needs span marker in nltk
],
)

from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser(
text_splitter=text_splitter,
metadata_extractor=metadata_extractor,
)


docs = SimpleDirectoryReader(input_files=["./assets/Interesting_Podcast.txt"]).load_data()

import nltk
nltk.download('punkt')

nodes = node_parser.get_nodes_from_documents(docs)

nodes[1].metadata
And I also have a weird error with a very basic hugingface embeddings example. Here is the code: from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings import LangchainEmbedding

text = "Wonderful day"
embed_model = LangchainEmbedding(
HuggingFaceEmbeddings("sentence-transformers/all-mpnet-base-v2")
)

embedding = embed_model.get_text_embedding(text)

print(embedding) I get an error in this line: HuggingFaceEmbeddings("sentence-transformers/all-mpnet-base-v2")... init() takes 1 positional argument but 2 were given Am I missing something? I just upgraded llamaindex
For the second error, I think it should be HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
For the entities, I'm not sure, the code looks correct 🤔 Are you also able to share the file? I can try running on my end
Sure, here is the file. Thank you!!
I found the problem with embeddings. Your documentation has a little error. Here: https://gpt-index.readthedocs.io/en/stable/examples/embeddings/Langchain.html It should be HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") instead of: HuggingFaceEmbeddings("sentence-transformers/all-mpnet-base-v2")
good catch, I thought we caught all of those examples 😆
I just copy-pasted your code and ran on the provided file. It seems to work for me? Here is the metadata for the first node for example

Plain Text
{'entities': {'Gina Poe', 'University of California , Los Angeles', 'Huberman Lab', 'Stanford', 'Andrew Huberman'}}
Maybe ensure you have the latest llama-index version and span-marker version?
YEs, I upgraded today
Without cuda though, it does take some time to run
yes, it ran for some minutes
will try again tomorrow if I find the problem
Yea very weird 🤔
thank you for your help!
the May I have just one little question? I'm playing with chroma. Is this a good way to add embeddings to chroma collection? I've never used chroma before 😊 : collection.add(
embeddings=[embed_model.get_text_embedding("No one is born with an instruction manual"), embed_model.get_text_embedding("You can’t read someone else’s mind"), embed_model.get_text_embedding("Say what you mean and mean what you say")],
documents=["No one is born with an instruction manual", "You can’t read someone else’s mind", "Say what you mean and mean what you say"],
metadatas=[{"source": "life experience"}, {"source": "life experience"}, {"source": "life experience"}],
ids=["id1", "id2", "id3"]
)
If you are intending to use llama-index with chroma, it's best to let llama-index handle the insertion

https://gpt-index.readthedocs.io/en/stable/examples/vector_stores/ChromaIndexDemo.html#chroma
great, thank you
Add a reply
Sign up and join the conversation on Discord