VectorStoreIndex.from_documents
does not insert the text
metadata to pineconedef import_file(dir: str): UnstructuredReader = download_loader('UnstructuredReader') PDFReader = download_loader("PDFReader") dir = "/Users/abv/workspace/ranch-bot/source_files/txt" dir_reader = SimpleDirectoryReader(dir, file_extractor={ ".txt": UnstructuredReader(), ".pdf": UnstructuredReader() }) documents = dir_reader.load_data() index = VectorStoreIndex.from_documents( documents, storage_context=get_pinecone_storage_context(), service_context=get_service_context() ) return index
β pip freeze aiohttp==3.9.1 aiosignal==1.3.1 aiostream==0.5.2 anyio==3.7.1 appnope==0.1.3 argcomplete==3.2.1 asttokens==2.4.1 async-timeout==4.0.3 attrs==23.1.0 backoff==2.2.1 beautifulsoup4==4.12.2 certifi==2023.11.17 chardet==5.2.0 charset-normalizer==3.3.2 click==8.1.7 comm==0.2.0 dataclasses-json==0.6.3 debugpy==1.8.0 decorator==5.1.1 Deprecated==1.2.14 distro==1.8.0 dnspython==2.4.2 docx2txt==0.8 EbookLib==0.18 emoji==2.9.0 exceptiongroup==1.2.0 executing==2.0.1 fastapi==0.98.0 fastapi-sessions==0.3.2 filelock==3.13.1 filetype==1.2.0 frozenlist==1.4.1 fsspec==2023.12.2 greenlet==3.0.2 h11==0.14.0 httpcore==1.0.2 httpx==0.25.2 huggingface-hub==0.19.4 idna==3.6 importlib-metadata==7.0.0 ipykernel==6.27.1 ipython==8.18.1 itsdangerous==2.1.2 jedi==0.19.1 joblib==1.3.2 jsonpatch==1.33 jsonpath-python==1.0.6 jsonpointer==2.4 jupyter_client==8.6.0 jupyter_core==5.5.1 langchain==0.0.340 langdetect==1.0.9 langsmith==0.0.71 llama-index==0.9.7 loguru==0.7.2 lxml==4.9.3 marshmallow==3.20.1 matplotlib-inline==0.1.6 multidict==6.0.4 mypy-extensions==1.0.0 nest-asyncio==1.5.8 nltk==3.8.1 numpy==1.26.2 openai==1.3.5 packaging==23.2 pandas==2.1.4 parso==0.8.3 pexpect==4.9.0 Pillow==10.1.0 pinecone-client==2.2.4 platformdirs==4.1.0 prompt-toolkit==3.0.43 psutil==5.9.7 ptyprocess==0.7.0 pure-eval==0.2.2 pydantic==1.10.13 Pygments==2.17.2 python-dateutil==2.8.2 python-dotenv==1.0.0 python-iso639==2023.12.11 python-magic==0.4.27 python-pptx==0.6.23 pytz==2023.3.post1 PyYAML==6.0.1 pyzmq==25.1.2 rapidfuzz==3.5.2 regex==2023.10.3 requests==2.31.0 safetensors==0.4.1 six==1.16.0 sniffio==1.3.0 soupsieve==2.5 SpeechRecognition==3.10.1 SQLAlchemy==2.0.23 stack-data==0.6.3 starlette==0.27.0 tabulate==0.9.0 tenacity==8.2.3 termcolor==2.3.0 textract==1.5.0 tiktoken==0.5.2 tokenizers==0.15.0 tornado==6.4 tqdm==4.66.1 traitlets==5.14.0 transformers==4.36.2 typing-inspect==0.9.0 typing_extensions==4.9.0 tzdata==2023.3 unstructured==0.11.5 unstructured-client==0.15.0 urllib3==2.1.0 uvicorn==0.22.0 wcwidth==0.2.12 wrapt==1.16.0 xlrd==2.0.1 XlsxWriter==3.1.9 yarl==1.9.4 zipp==3.17.0
document
object that came from the loader text is there, but not in the metadatachunk_size - len(metadata)
-- since metadata is included when sending to the LLM/embedding model._node_content
field -- which is also used to reconstruct the node when fetchingtext_key
https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.pinecone.Pinecone.html?highlight=ineconeservice_context = ServiceContext.from_defaults(..., chunk_size=512)
from llama_index.text_splitters import TokenTextSplitter service_context = ServiceContext.from_defaults(..., text_splitter=TokenTextSplitter(chunk_size=512))
def get_llm_predictor(): return LLMPredictor(llm=ChatOpenAI(temperature=0, max_tokens=512, model_name=CHAT_MODEL)) def get_service_context(): llm_predictor_chatgpt = get_llm_predictor() return ServiceContext.from_defaults(llm_predictor=llm_predictor_chatgpt)