Find answers from the community

Updated 5 months ago

im getting a parse error (used llamaparse and now get_nodes_from_documents, error tokenizi

@kapa.ai im getting a parse error (used llamaparse and now get_nodes_from_documents, error tokenizing data. C error: EOF inside string starting at row 0.
k
a
10 comments
Please stand by @akvn as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
I'm sorry to hear that you're encountering a parsing error. The error message you're seeing typically occurs when there's a problem with the format of the data you're trying to parse. It's possible that there's an issue with the documents you're passing to the get_nodes_from_documents method.

Here's a general example of how to use get_nodes_from_documents:

Plain Text
from llama_index.core.node_parser import SimpleNodeParser
node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)
nodes = node_parser.get_nodes_from_documents(docs)

(Source: LlamaIndex Examples ())

In this example, docs is a list of Document objects. Each Document object contains the text to be parsed and optional metadata.

If you're still encountering the error, it would be helpful to check the content of your documents. Make sure that the text doesn't contain any unexpected characters or formatting that might be causing the parser to fail.

If the issue persists, I would recommend reaching out to the LlamaIndex support or checking their documentation for more detailed information on how to use their parsing functions.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
but the data im using is documents which is returned by llamaparse
but the data im using is the documents which is returned by llamaparse @kapa.ai
Let me check the knowledge base for you. One moment! :books:
----------------------
I see, you're using the LlamaParse to load your data and then passing the returned documents to get_nodes_from_documents. Here's a general example of how you might do this:

Plain Text
from llama_parse import LlamaParse
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI

# Load data using LlamaParse
parser = LlamaParse(result_type="markdown")
documents = parser.load_data(file_path="./data/your_file.pdf")

# Parse nodes from documents
md_node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4.5-turbo-preview"),
    num_workers=3,
    include_metadata=True,
)
nodes = md_node_parser.get_nodes_from_documents(documents)

(Source: LlamaIndex GitHub ())

If you're still encountering the error, it's possible that there's an issue with the specific documents you're trying to parse. The error message suggests that there might be a problem with the format of the data in the documents.

You might want to check the content of the documents returned by LlamaParse to see if there's anything unusual that might be causing the issue. If the problem persists, I would recommend reaching out to the LlamaIndex support for further assistance.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
what other parsers can i use? @kapa.ai
Let me check the knowledge base for you. One moment! :books:
----------------------
LlamaIndex provides several parsers that you can use to parse different types of content. Here are some of them:

  1. SimpleFileNodeParser: This parser can be used to automatically select the best node parser for each type of content. Here's an example of how to use it:
Plain Text
    from llama_index.core.node_parser import SimpleFileNodeParser
    from llama_index.readers.file import FlatReader
    from pathlib import Path

    md_docs = FlatReader().load_data(Path("./test.md"))

    parser = SimpleFileNodeParser()
    md_nodes = parser.get_nodes_from_documents(md_docs)
    

(Source: LlamaIndex Documentation ())

  1. HTMLNodeParser: This parser uses beautifulsoup to parse raw HTML.
Plain Text
    from llama_index.core.node_parser import HTMLNodeParser

    parser = HTMLNodeParser(tags=["p", "h1"])  # optional list of tags
    nodes = parser.get_nodes_from_documents(html_docs)
    

(Source: LlamaIndex Documentation ())

  1. JSONNodeParser: This parser is used to parse raw JSON.
Plain Text
    from llama_index.core.node_parser import JSONNodeParser

    parser = JSONNodeParser()
    nodes = parser.get_nodes_from_documents(json_docs)
    

(Source: LlamaIndex Documentation ())

  1. MarkdownNodeParser: This parser is used to parse raw markdown text.
```python
from llama_index.core.node_parser import MarkdownNodeParser
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(markdown_docs)
```
(Source: LlamaIndex Documentation ())

Remember to replace the example file paths and variable names with your actual data.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord