I have a document of 700 pages, the

At a glance

The community member has a 700-page document and is trying to create embeddings for it and upload them to a vector store (Datastax Astra), but is getting an error due to the document size limitation. The community members are discussing ways to split the document, such as using a text splitter or the Ingestion pipeline in Llama Index TS. One community member suggests creating nodes and adding metadata to the documents before uploading them to the vector store. The community members also mention a Discord channel for Llama Index TS where the community member can post their query.

Useful resources

iisra_j

I have a document of 700 pages, the issue is that when i try to create embeddings for that file and try to upload the embeddings to vector store in my case Datastax Astra , i get error that
"error": "Error: Command "insertMany" failed with the following errors: [{"message":"Document size limitation violated: indexed String value (property 'content') length (17900 bytes) exceeds maximum allowed (8000 bytes)","errorCode":"SHREDDOC_LIMIT_VIOLATION"}]"
Is there any way i can split each document ? Please help, im new to Llama Index Ts
I guess i have to use text splitter or what ??

6 comments

WWhiteFang_Jr

Yes, You can use Textsplitter to split the text into required size.
Are you creating Nodes by yourself ?
by default it should break document into limited size IMO.

Do try this and see if it solves your issue.

iisra_j

This is how im creating documents, can you help me add sentence splitter in it

Plain Text

try {
      const dataPath = path.resolve('../documents');
      const reader = new SimpleDirectoryReader();
      const documents = await reader.loadData({ directoryPath: dataPath });
      const documentsWithMetadata = [];
      
      //uploading files to firebase storage and collection
      const uploadedFiles = await uploadFiles();
      const filesInfo = uploadedFiles.uploadedFilesInfo;
      
      documents.forEach((doc) => {
        const fileName = path.basename(doc.id_);
        const fileObj = filesInfo.find((info) => info.hasOwnProperty(fileName));
        const fileId = fileObj ? fileObj[fileName].id : null;
        
        // Create a Document instance with metadata
        const documentWithMetadata = new Document({
            // embedding:5,
            text: doc.text,
            metadata: {
                filename: fileName,
                fileId: fileId
            }
        });
        documentsWithMetadata.push(documentWithMetadata);
      });
  
      //connection with AstraDB
      const astraVS = new AstraDBVectorStore();
      await astraVS.create(collectionName, {
          vector: { dimension: 1536, metric: "cosine" },
      });
      await astraVS.connect(collectionName);
      const ctx = await storageContextFromDefaults({ vectorStore: astraVS });
      await VectorStoreIndex.fromDocuments(documentsWithMetadata, {
          storageContext: ctx,
      });

WWhiteFang_Jr

One thing,

Plain Text

documents.forEach((doc) => {
        const fileName = path.basename(doc.id_);
        const fileObj = filesInfo.find((info) => info.hasOwnProperty(fileName));
        const fileId = fileObj ? fileObj[fileName].id : null;
        
        // Create a Document instance with metadata
        doc.metadata = { ADD your dict here }
      });

This should also work.

For the error part, Use Ingestion pipeline to chunk it into nodes before passing it into VectorStoreIndex: https://ts.llamaindex.ai/modules/ingestion_pipeline/

Also for TS specific, we have a channel: https://discord.com/channels/1059199217496772688/1133167189860565033

You can post the query there as well

iisra_j

Thank you so much, one question can you tell what is meant my dict below ?
doc.metadata = { ADD your dict here }

WWhiteFang_Jr

My TS is not so good lol 😆 hence proved

I was referring this:

Plain Text

doc.metadata = {
                filename: fileName,
                fileId: fileId
            }

iisra_j

will tis thing create nodes ? Does this means that right now this code is not creating any nodes ?

Add a reply

Find answers from the community

I have a document of 700 pages, the