Find answers from the community

Updated 10 months ago

I have a document of 700 pages, the

I have a document of 700 pages, the issue is that when i try to create embeddings for that file and try to upload the embeddings to vector store in my case Datastax Astra , i get error that
"error": "Error: Command "insertMany" failed with the following errors: [{"message":"Document size limitation violated: indexed String value (property 'content') length (17900 bytes) exceeds maximum allowed (8000 bytes)","errorCode":"SHREDDOC_LIMIT_VIOLATION"}]"
Is there any way i can split each document ? Please help, im new to Llama Index Ts
I guess i have to use text splitter or what ??
W
i
6 comments
Yes, You can use Textsplitter to split the text into required size.
Are you creating Nodes by yourself ?
by default it should break document into limited size IMO.


Do try this and see if it solves your issue.
This is how im creating documents, can you help me add sentence splitter in it
Plain Text
try {
      const dataPath = path.resolve('../documents');
      const reader = new SimpleDirectoryReader();
      const documents = await reader.loadData({ directoryPath: dataPath });
      const documentsWithMetadata = [];
      
      //uploading files to firebase storage and collection
      const uploadedFiles = await uploadFiles();
      const filesInfo = uploadedFiles.uploadedFilesInfo;
      
      documents.forEach((doc) => {
        const fileName = path.basename(doc.id_);
        const fileObj = filesInfo.find((info) => info.hasOwnProperty(fileName));
        const fileId = fileObj ? fileObj[fileName].id : null;
        
        // Create a Document instance with metadata
        const documentWithMetadata = new Document({
            // embedding:5,
            text: doc.text,
            metadata: {
                filename: fileName,
                fileId: fileId
            }
        });
        documentsWithMetadata.push(documentWithMetadata);
      });
  
      //connection with AstraDB
      const astraVS = new AstraDBVectorStore();
      await astraVS.create(collectionName, {
          vector: { dimension: 1536, metric: "cosine" },
      });
      await astraVS.connect(collectionName);
      const ctx = await storageContextFromDefaults({ vectorStore: astraVS });
      await VectorStoreIndex.fromDocuments(documentsWithMetadata, {
          storageContext: ctx,
      });
One thing,

Plain Text
documents.forEach((doc) => {
        const fileName = path.basename(doc.id_);
        const fileObj = filesInfo.find((info) => info.hasOwnProperty(fileName));
        const fileId = fileObj ? fileObj[fileName].id : null;
        
        // Create a Document instance with metadata
        doc.metadata = { ADD your dict here }
      });

This should also work.


For the error part, Use Ingestion pipeline to chunk it into nodes before passing it into VectorStoreIndex: https://ts.llamaindex.ai/modules/ingestion_pipeline/

Also for TS specific, we have a channel: https://discord.com/channels/1059199217496772688/1133167189860565033

You can post the query there as well
Thank you so much, one question can you tell what is meant my dict below ?
doc.metadata = { ADD your dict here }
My TS is not so good lol πŸ˜† hence proved

I was referring this:
Plain Text
doc.metadata = {
                filename: fileName,
                fileId: fileId
            }
will tis thing create nodes ? Does this means that right now this code is not creating any nodes ?
Add a reply
Sign up and join the conversation on Discord