Metadata

At a glance

EntityExtractor for some reason does not wxtract any entities for me. I have installed the span-marker, nltk and punks. Metadata stays empty. What can I check? I'm using it with a long txt file with content.

21 comments

LLogan M

Hmmm that's weird 🤔 can you share the code you have?

JJana

Sure!

JJana

from llama_index.node_parser import SimpleNodeParser
from llama_index.node_parser.extractors import (
MetadataExtractor,
SummaryExtractor,
QuestionsAnsweredExtractor,
TitleExtractor,
KeywordExtractor,
EntityExtractor,
MetadataFeatureExtractor,
)
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)

metadata_extractor = MetadataExtractor(
extractors=[
EntityExtractor(prediction_threshold=0.5), # needs span marker in nltk
],
)

from llama_index import SimpleDirectoryReader
from llama_index.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser(
text_splitter=text_splitter,
metadata_extractor=metadata_extractor,
)

docs = SimpleDirectoryReader(input_files=["./assets/Interesting_Podcast.txt"]).load_data()

import nltk
nltk.download('punkt')

nodes = node_parser.get_nodes_from_documents(docs)

nodes[1].metadata

JJana

And I also have a weird error with a very basic hugingface embeddings example. Here is the code: from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings import LangchainEmbedding

text = "Wonderful day"
embed_model = LangchainEmbedding(
HuggingFaceEmbeddings("sentence-transformers/all-mpnet-base-v2")
)

embedding = embed_model.get_text_embedding(text)

print(embedding) I get an error in this line: HuggingFaceEmbeddings("sentence-transformers/all-mpnet-base-v2")... init() takes 1 positional argument but 2 were given Am I missing something? I just upgraded llamaindex

LLogan M

For the second error, I think it should be HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

LLogan M

For the entities, I'm not sure, the code looks correct 🤔 Are you also able to share the file? I can try running on my end

JJana

Sure, here is the file. Thank you!!

JJana

I found the problem with embeddings. Your documentation has a little error. Here: https://gpt-index.readthedocs.io/en/stable/examples/embeddings/Langchain.html It should be HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") instead of: HuggingFaceEmbeddings("sentence-transformers/all-mpnet-base-v2")

LLogan M

good catch, I thought we caught all of those examples 😆

LLogan M

I just copy-pasted your code and ran on the provided file. It seems to work for me? Here is the metadata for the first node for example

Plain Text

{'entities': {'Gina Poe', 'University of California , Los Angeles', 'Huberman Lab', 'Stanford', 'Andrew Huberman'}}

LLogan M

Maybe ensure you have the latest llama-index version and span-marker version?

JJana

YEs, I upgraded today

LLogan M

Without cuda though, it does take some time to run

JJana

yes, it ran for some minutes

JJana

will try again tomorrow if I find the problem

JJana

strange

LLogan M

Yea very weird 🤔

JJana

thank you for your help!

JJana

the May I have just one little question? I'm playing with chroma. Is this a good way to add embeddings to chroma collection? I've never used chroma before 😊 : collection.add(
embeddings=[embed_model.get_text_embedding("No one is born with an instruction manual"), embed_model.get_text_embedding("You can’t read someone else’s mind"), embed_model.get_text_embedding("Say what you mean and mean what you say")],
documents=["No one is born with an instruction manual", "You can’t read someone else’s mind", "Say what you mean and mean what you say"],
metadatas=[{"source": "life experience"}, {"source": "life experience"}, {"source": "life experience"}],
ids=["id1", "id2", "id3"]
)

LLogan M

If you are intending to use llama-index with chroma, it's best to let llama-index handle the insertion

https://gpt-index.readthedocs.io/en/stable/examples/vector_stores/ChromaIndexDemo.html#chroma

JJana

great, thank you

Add a reply

Find answers from the community

Metadata