im getting a parse error (used llamaparse and now get

At a glance

A community member is encountering a parsing error when using the get_nodes_from_documents method from the LlamaIndex library. The error message suggests a problem with the format of the data being parsed. The community members provide some general examples of how to use the get_nodes_from_documents method and suggest checking the content of the documents being passed to it. They also mention that the community member is using the LlamaParse library to load the data, and provide an example of how to integrate LlamaParse with LlamaIndex. Additionally, the community members list several other parsers available in LlamaIndex, such as SimpleFileNodeParser, HTMLNodeParser, JSONNodeParser, and MarkdownNodeParser, which can be used to parse different types of content.

aakvn

@kapa.ai im getting a parse error (used llamaparse and now get_nodes_from_documents, error tokenizing data. C error: EOF inside string starting at row 0.

10 comments

kkapa.ai

Please stand by @akvn as I search the knowledge base. I normally take about 20 seconds or so. :wave:

kkapa.ai

----------------------
I'm sorry to hear that you're encountering a parsing error. The error message you're seeing typically occurs when there's a problem with the format of the data you're trying to parse. It's possible that there's an issue with the documents you're passing to the get_nodes_from_documents method.

Here's a general example of how to use get_nodes_from_documents:

Plain Text

from llama_index.core.node_parser import SimpleNodeParser
node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)
nodes = node_parser.get_nodes_from_documents(docs)

(Source: LlamaIndex Examples ())

In this example, docs is a list of Document objects. Each Document object contains the text to be parsed and optional metadata.

If you're still encountering the error, it would be helpful to check the content of your documents. Make sure that the text doesn't contain any unexpected characters or formatting that might be causing the parser to fail.

If the issue persists, I would recommend reaching out to the LlamaIndex support or checking their documentation for more detailed information on how to use their parsing functions.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

aakvn

but the data im using is documents which is returned by llamaparse

aakvn

but the data im using is the documents which is returned by llamaparse @kapa.ai

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
I see, you're using the LlamaParse to load your data and then passing the returned documents to get_nodes_from_documents. Here's a general example of how you might do this:

Plain Text

from llama_parse import LlamaParse
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI

# Load data using LlamaParse
parser = LlamaParse(result_type="markdown")
documents = parser.load_data(file_path="./data/your_file.pdf")

# Parse nodes from documents
md_node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-4.5-turbo-preview"),
    num_workers=3,
    include_metadata=True,
)
nodes = md_node_parser.get_nodes_from_documents(documents)

(Source: LlamaIndex GitHub ())

If you're still encountering the error, it's possible that there's an issue with the specific documents you're trying to parse. The error message suggests that there might be a problem with the format of the data in the documents.

You might want to check the content of the documents returned by LlamaParse to see if there's anything unusual that might be causing the issue. If the problem persists, I would recommend reaching out to the LlamaIndex support for further assistance.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

aakvn

what other parsers can i use? @kapa.ai

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

----------------------
LlamaIndex provides several parsers that you can use to parse different types of content. Here are some of them:

SimpleFileNodeParser: This parser can be used to automatically select the best node parser for each type of content. Here's an example of how to use it:

Plain Text

    from llama_index.core.node_parser import SimpleFileNodeParser
    from llama_index.readers.file import FlatReader
    from pathlib import Path

    md_docs = FlatReader().load_data(Path("./test.md"))

    parser = SimpleFileNodeParser()
    md_nodes = parser.get_nodes_from_documents(md_docs)

(Source: LlamaIndex Documentation ())

HTMLNodeParser: This parser uses beautifulsoup to parse raw HTML.

Plain Text

    from llama_index.core.node_parser import HTMLNodeParser

    parser = HTMLNodeParser(tags=["p", "h1"])  # optional list of tags
    nodes = parser.get_nodes_from_documents(html_docs)

(Source: LlamaIndex Documentation ())

JSONNodeParser: This parser is used to parse raw JSON.

Plain Text

    from llama_index.core.node_parser import JSONNodeParser

    parser = JSONNodeParser()
    nodes = parser.get_nodes_from_documents(json_docs)

(Source: LlamaIndex Documentation ())

MarkdownNodeParser: This parser is used to parse raw markdown text.

```python
from llama_index.core.node_parser import MarkdownNodeParser

kkapa.ai

parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(markdown_docs)
```
(Source: LlamaIndex Documentation ())

Remember to replace the example file paths and variable names with your actual data.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

Add a reply

Find answers from the community

im getting a parse error (used llamaparse and now get_nodes_from_documents, error tokenizi