Find answers from the community

Updated last year

Pinecone

At a glance

The community members are discussing an issue with the LlamaIndex Loader in versions 0.6.24 and 0.7.23, where the VectorStoreIndex.from_documents function does not insert the text metadata to Pinecone. They have tried various approaches, such as adding text to the metadata, controlling the chunk size, and using the _node_content field as the text key. The community members also mention that the text splitting and chunk size have likely changed since version 0.6.x, and they provide suggestions on how to configure the service context to handle the chunking and metadata. The issue seems to be resolved by using the _node_content field as the text key, but the community members are still exploring the differences in the number of vectors between the older and newer versions of LlamaIndex.

Useful resources
Doing some splunking when you use a LlamaIndex Loader in versions v0.6.24 and 0.7.23 the VectorStoreIndex.from_documents does not insert the text metadata to pinecone
L
a
33 comments
Those are some oooold versions.

I'm pretty sure the latest works fine. But there might have been some minor tweaks. Now we serialize the entire node object into pinecone
It doens't. what had me notice the issue was upgrading
Plain Text
def import_file(dir: str):
    UnstructuredReader = download_loader('UnstructuredReader')
    PDFReader = download_loader("PDFReader")
    dir = "/Users/abv/workspace/ranch-bot/source_files/txt"
    dir_reader = SimpleDirectoryReader(dir, file_extractor={
      ".txt": UnstructuredReader(),
      ".pdf": UnstructuredReader()
    })
    documents = dir_reader.load_data()

    index = VectorStoreIndex.from_documents(
            documents,
            storage_context=get_pinecone_storage_context(),
            service_context=get_service_context()
            )

    return index
Simple code for the import of a file gives a pinecone vector like this.
Plain Text
βœ— pip freeze
aiohttp==3.9.1
aiosignal==1.3.1
aiostream==0.5.2
anyio==3.7.1
appnope==0.1.3
argcomplete==3.2.1
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.1.0
backoff==2.2.1
beautifulsoup4==4.12.2
certifi==2023.11.17
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
comm==0.2.0
dataclasses-json==0.6.3
debugpy==1.8.0
decorator==5.1.1
Deprecated==1.2.14
distro==1.8.0
dnspython==2.4.2
docx2txt==0.8
EbookLib==0.18
emoji==2.9.0
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.98.0
fastapi-sessions==0.3.2
filelock==3.13.1
filetype==1.2.0
frozenlist==1.4.1
fsspec==2023.12.2
greenlet==3.0.2
h11==0.14.0
httpcore==1.0.2
httpx==0.25.2
huggingface-hub==0.19.4
idna==3.6
importlib-metadata==7.0.0
ipykernel==6.27.1
ipython==8.18.1
itsdangerous==2.1.2
jedi==0.19.1
joblib==1.3.2
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==2.4
jupyter_client==8.6.0
jupyter_core==5.5.1
langchain==0.0.340
langdetect==1.0.9
langsmith==0.0.71
llama-index==0.9.7
loguru==0.7.2
lxml==4.9.3
marshmallow==3.20.1
matplotlib-inline==0.1.6
multidict==6.0.4
mypy-extensions==1.0.0
nest-asyncio==1.5.8
nltk==3.8.1
numpy==1.26.2
openai==1.3.5
packaging==23.2
pandas==2.1.4
parso==0.8.3
pexpect==4.9.0
Pillow==10.1.0
pinecone-client==2.2.4
platformdirs==4.1.0
prompt-toolkit==3.0.43
psutil==5.9.7
ptyprocess==0.7.0
pure-eval==0.2.2
pydantic==1.10.13
Pygments==2.17.2
python-dateutil==2.8.2
python-dotenv==1.0.0
python-iso639==2023.12.11
python-magic==0.4.27
python-pptx==0.6.23
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.2
rapidfuzz==3.5.2
regex==2023.10.3
requests==2.31.0
safetensors==0.4.1
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
SpeechRecognition==3.10.1
SQLAlchemy==2.0.23
stack-data==0.6.3
starlette==0.27.0
tabulate==0.9.0
tenacity==8.2.3
termcolor==2.3.0
textract==1.5.0
tiktoken==0.5.2
tokenizers==0.15.0
tornado==6.4
tqdm==4.66.1
traitlets==5.14.0
transformers==4.36.2
typing-inspect==0.9.0
typing_extensions==4.9.0
tzdata==2023.3
unstructured==0.11.5
unstructured-client==0.15.0
urllib3==2.1.0
uvicorn==0.22.0
wcwidth==0.2.12
wrapt==1.16.0
xlrd==2.0.1
XlsxWriter==3.1.9
yarl==1.9.4
zipp==3.17.0
I played through each release and it stopped working at 0.6.24
If I inspect the document object that came from the loader text is there, but not in the metadata
if I add text to the metadata it's pushed to pinecone
if I add too much text to the metadata the chunking is not handled automagically
The chunking is handled, but only up to a certain point. If you specify a chunk_size of 512, then the real chunk size is chunk_size - len(metadata) -- since metadata is included when sending to the LLM/embedding model.

You can control which metadata is sent to each though
Wait, why does the text need to be a key? It's technically in that _node_content field -- which is also used to reconstruct the node when fetching
Hmm, not sure what to tell you. I would use a llama-index retriever for retrieving from a pinecone db built with llama-index
Well that fixed it
i just used _node_content as the text key
what's odd is the number of vectors is very different when I did the import in the newer version of llamaindex
The text splitting and chunk size have likely changed since 0.6.x

Default is SentenceSplitter (which splits by respecting sentence boundaries) with chunk_size=1024 and chunk_overlap=20
How do you configure that with the loader?
how are you loading/inserting data right now?
It'll be part of the service context
Plain Text
service_context = ServiceContext.from_defaults(..., chunk_size=512)


Or

Plain Text
from llama_index.text_splitters import TokenTextSplitter
service_context = ServiceContext.from_defaults(..., text_splitter=TokenTextSplitter(chunk_size=512))
Plain Text
def get_llm_predictor():
    return LLMPredictor(llm=ChatOpenAI(temperature=0, max_tokens=512, model_name=CHAT_MODEL))

def get_service_context():
    llm_predictor_chatgpt = get_llm_predictor()
    return ServiceContext.from_defaults(llm_predictor=llm_predictor_chatgpt)
would you add that to the ServiceContract factory method?
Does it need to match the max_tokens of the LLM?
no, max_tokens is unrelated to the chunk size
how do you control the metadata. That's the issue. If I could add text to the metadata boom solved
Add a reply
Sign up and join the conversation on Discord