Find answers from the community

Updated 11 months ago

I have a document of 700 pages, the

At a glance

The community member has a 700-page document and is trying to create embeddings for it and upload them to a vector store (Datastax Astra), but is getting an error due to the document size limitation. The community members are discussing ways to split the document, such as using a text splitter or the Ingestion pipeline in Llama Index TS. One community member suggests creating nodes and adding metadata to the documents before uploading them to the vector store. The community members also mention a Discord channel for Llama Index TS where the community member can post their query.

Useful resources
I have a document of 700 pages, the issue is that when i try to create embeddings for that file and try to upload the embeddings to vector store in my case Datastax Astra , i get error that
"error": "Error: Command "insertMany" failed with the following errors: [{"message":"Document size limitation violated: indexed String value (property 'content') length (17900 bytes) exceeds maximum allowed (8000 bytes)","errorCode":"SHREDDOC_LIMIT_VIOLATION"}]"
Is there any way i can split each document ? Please help, im new to Llama Index Ts
I guess i have to use text splitter or what ??
W
i
6 comments
Yes, You can use Textsplitter to split the text into required size.
Are you creating Nodes by yourself ?
by default it should break document into limited size IMO.


Do try this and see if it solves your issue.
This is how im creating documents, can you help me add sentence splitter in it
Plain Text
try {
      const dataPath = path.resolve('../documents');
      const reader = new SimpleDirectoryReader();
      const documents = await reader.loadData({ directoryPath: dataPath });
      const documentsWithMetadata = [];
      
      //uploading files to firebase storage and collection
      const uploadedFiles = await uploadFiles();
      const filesInfo = uploadedFiles.uploadedFilesInfo;
      
      documents.forEach((doc) => {
        const fileName = path.basename(doc.id_);
        const fileObj = filesInfo.find((info) => info.hasOwnProperty(fileName));
        const fileId = fileObj ? fileObj[fileName].id : null;
        
        // Create a Document instance with metadata
        const documentWithMetadata = new Document({
            // embedding:5,
            text: doc.text,
            metadata: {
                filename: fileName,
                fileId: fileId
            }
        });
        documentsWithMetadata.push(documentWithMetadata);
      });
  
      //connection with AstraDB
      const astraVS = new AstraDBVectorStore();
      await astraVS.create(collectionName, {
          vector: { dimension: 1536, metric: "cosine" },
      });
      await astraVS.connect(collectionName);
      const ctx = await storageContextFromDefaults({ vectorStore: astraVS });
      await VectorStoreIndex.fromDocuments(documentsWithMetadata, {
          storageContext: ctx,
      });
One thing,

Plain Text
documents.forEach((doc) => {
        const fileName = path.basename(doc.id_);
        const fileObj = filesInfo.find((info) => info.hasOwnProperty(fileName));
        const fileId = fileObj ? fileObj[fileName].id : null;
        
        // Create a Document instance with metadata
        doc.metadata = { ADD your dict here }
      });

This should also work.


For the error part, Use Ingestion pipeline to chunk it into nodes before passing it into VectorStoreIndex: https://ts.llamaindex.ai/modules/ingestion_pipeline/

Also for TS specific, we have a channel: https://discord.com/channels/1059199217496772688/1133167189860565033

You can post the query there as well
Thank you so much, one question can you tell what is meant my dict below ?
doc.metadata = { ADD your dict here }
My TS is not so good lol πŸ˜† hence proved

I was referring this:
Plain Text
doc.metadata = {
                filename: fileName,
                fileId: fileId
            }
will tis thing create nodes ? Does this means that right now this code is not creating any nodes ?
Add a reply
Sign up and join the conversation on Discord