Find answers from the community

Updated 7 days ago

what is the difference between TokenSplitter and SentenceSplitter, which items are the bes

@kapa.ai what is the difference between TokenSplitter and SentenceSplitter, which items are the best to use?
k
c
28 comments
Please stand by @cmosguy as I search the knowledge base. I normally take about 15 seconds or so. :wave:
----------------------
To answer your question about the difference between TokenSplitter and SentenceSplitter in LlamaIndex, let me provide you with the key distinctions and recommendations based on the available information:

  1. TokenTextSplitter:
The TokenTextSplitter attempts to split text into chunks based on raw token counts. It aims for a consistent chunk size according to the number of tokens.

Plain Text
from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter(
    chunk_size=1024,
    chunk_overlap=20,
    separator=" ",
)
nodes = splitter.get_nodes_from_documents(documents)


  1. SentenceSplitter:
The SentenceSplitter attempts to split text while respecting the boundaries of sentences. It has a preference for keeping complete sentences and paragraphs together.

Plain Text
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)


The main difference is that SentenceSplitter tries to maintain the integrity of sentences and paragraphs, while TokenTextSplitter focuses solely on the token count.

As for which one is best to use, it depends on your specific use case. However, the SentenceSplitter is generally preferred because it maintains the context and coherence of the text better by keeping sentences intact. This can lead to more meaningful chunks for processing by language models.

The LlamaIndex documentation on Node Parser Modules suggests that SentenceSplitter is more commonly used and is often the default choice in many LlamaIndex examples and tutorials.
That being said, if you have specific requirements related to token counts or if you're working with text that doesn't follow standard sentence structures, the TokenTextSplitter might be more appropriate.

Remember that LlamaIndex also offers other splitters like CodeSplitter for programming languages and SemanticSplitterNodeParser for more advanced semantic-based splitting. The choice ultimately depends on the nature of your text and your specific requirements.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how do I use a multiple post processors with a retriever and query the retriver?
Let me check the knowledge base for you again. One moment! :books:
----------------------
To use multiple postprocessors with a retriever and query it in LlamaIndex, you can follow these steps:

  1. First, set up your retriever:
Plain Text
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

# Assuming you have already created your index
index = VectorStoreIndex.from_documents(documents)

# Create a retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10  # Adjust as needed
)


  1. Define your postprocessors. LlamaIndex offers various postprocessors, such as KeywordNodePostprocessor, SimilarityPostprocessor, and others. Here's an example of setting up multiple postprocessors:
Plain Text
from llama_index.core.postprocessor import KeywordNodePostprocessor, SimilarityPostprocessor

# Create postprocessors
keyword_postprocessor = KeywordNodePostprocessor(
    required_keywords=["your_keyword"],
    exclude_keywords=["exclude_word"]
)

similarity_postprocessor = SimilarityPostprocessor(similarity_cutoff=0.7)

# Combine postprocessors into a list
node_postprocessors = [keyword_postprocessor, similarity_postprocessor]


  1. Create a query engine that combines the retriever and postprocessors:
Plain Text
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever,
    node_postprocessors=node_postprocessors
)


  1. Now you can query the retriever with the postprocessors applied:
Plain Text
response = query_engine.query("Your query here")
print(response)


This setup will first use the retriever to fetch relevant nodes, then apply the keyword postprocessor to filter nodes based on required and excluded keywords, and finally apply the similarity postprocessor to further filter nodes based on a similarity threshold.
You can customize this setup by adding or removing postprocessors as needed for your specific use case. The order of postprocessors in the list matters, as they are applied sequentially.

For more detailed information and advanced configurations, you can refer to the LlamaIndex documentation on Customizing the stages of querying.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
is there a way to make a retriever that automatically executes the node post processors and does not go trough an llm
@kapa.ai is there a way to make a retriever that automatically executes the node post processors and does not go trough an llm
Let me check the knowledge base for you again. One moment! :books:
----------------------
Yes, there is a way to create a retriever that automatically executes node postprocessors without going through an LLM in LlamaIndex. This can be achieved using the ObjectRetriever class, which allows you to wrap a base retriever and apply node postprocessors automatically.

Here's how you can set this up:

  1. First, import the necessary classes:
Plain Text
from llama_index.core.objects import ObjectRetriever
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.core.postprocessor import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore, QueryBundle


  1. Create your base retriever and node postprocessors:
Plain Text
# Assume you have already created your base_retriever
base_retriever: BaseRetriever = ...

# Create your node postprocessors
node_postprocessors: List[BaseNodePostprocessor] = [
    # Add your postprocessors here
]


  1. Create an ObjectRetriever that wraps your base retriever and applies the postprocessors:
Plain Text
retriever = ObjectRetriever(
    retriever=base_retriever,
    object_node_mapping=lambda x: x,  # Identity mapping for NodeWithScore objects
    node_postprocessors=node_postprocessors
)


  1. Now you can use this retriever to fetch and postprocess nodes without going through an LLM:
Plain Text
query = "Your query here"
retrieved_nodes = retriever.retrieve(query)


The retrieved_nodes will be a list of postprocessed NodeWithScore objects.

This approach allows you to apply node postprocessors automatically as part of the retrieval process, without involving an LLM. The ObjectRetriever handles the application of postprocessors internally.
For more details on the ObjectRetriever class and its usage, you can refer to the LlamaIndex API documentation on ObjectRetriever.

Remember that while this method allows you to retrieve and postprocess nodes without an LLM, you won't get a synthesized response. If you need to generate a response based on the retrieved nodes, you would still need to use an LLM or another method to process the retrieved information.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai what is an objectretriever what is it used for?
Let me check the knowledge base for you again. One moment! :books:
----------------------
An ObjectRetriever is a specialized retriever class in LlamaIndex that is designed to work with arbitrary Python objects. It's a versatile tool that can be used for various purposes, particularly when you need to retrieve and work with custom objects rather than just text or nodes. Here are the key points about ObjectRetriever:

  1. Purpose:
    The primary purpose of an ObjectRetriever is to retrieve objects based on a query. It's particularly useful when you want to index and retrieve custom Python objects, such as tools, database schemas, or any other structured data.
  1. Functionality:
    • It wraps a base retriever (like a vector store retriever) and adds the ability to work with objects.
    • It can automatically apply node postprocessors to the retrieved results.
    • It converts between nodes (used by the underlying index) and objects.
  1. Common Use Cases:
    • Tool Retrieval for Agents: One of the most common uses is to retrieve relevant tools for AI agents. This is demonstrated in the OpenAI Agent with Tool Retrieval example.
    • Indexing and Retrieving Custom Objects: It can be used to create an index of any type of Python object and then retrieve those objects based on queries.
  1. How it works:
    • It uses an underlying index (often a VectorStoreIndex) to store and retrieve nodes.
    • It employs an object-node mapping to convert between objects and nodes.
    • When retrieving, it first uses the base retriever to get relevant nodes, then applies any specified node postprocessors, and finally converts the nodes back to objects.
  1. Example Usage:
    Here's a simplified example of how you might use an ObjectRetriever:
```python
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex

Create an ObjectIndex

obj_index = ObjectIndex.from_objects(
your_objects,
index_cls=VectorStoreIndex,
)

# Create an ObjectRetriever
object_retriever = obj_index.as_retriever(
similarity_top_k=2,
node_postprocessors=[YourPostProcessor()]
)

# Retrieve objects
retrieved_objects = object_retriever.retrieve("your query")
```

The ObjectRetriever is particularly powerful because it allows you to work with complex, structured data while still leveraging the retrieval capabilities of LlamaIndex. This makes it an essential tool for building more sophisticated applications, especially those involving AI agents or complex data structures.

LlamaIndex API documentation on ObjectRetriever
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai I have the pipeline: pipeline = IngestionPipeline(
transformations=[
TokenTextSplitter(),
MarkdownElementNodeParser(llm=llm_4o_2, show_progress=verbose),
embedding_large,
],
vector_store=vector_store,
docstore=SimpleDocumentStore(),
docstore_strategy=DocstoreStrategy.UPSERTS,
)

when I do the document loading i see messages:
Insert of existing embedding ID: e6d11710-da57-4ae2-a875-c6c5c4105681

what does this UPSERT mean exactly?
Let me check the knowledge base for you again. One moment! :books:
----------------------
Based on the information provided in the knowledge sources, I can explain what UPSERT means in the context of your IngestionPipeline and the DocstoreStrategy.UPSERTS setting.

UPSERT is a combination of "update" and "insert". In the context of LlamaIndex's DocstoreStrategy, UPSERTS means:

  1. The system checks if a document is already in the document store based on its ID.
  2. If the document is not in the store, or if the hash of the document has been updated, it will update the document in the document store and run the transformations.
Specifically, according to the LlamaIndex documentation on DocstoreStrategy, the UPSERTS strategy is defined as:

"Use upserts to handle duplicates. Checks if the a document is already in the doc store based on its id. If it is not, or if the hash of the document is updated, it will update the document in the doc store and run the transformations."

The message you're seeing, "Insert of existing embedding ID: e6d11710-da57-4ae2-a875-c6c5c4105681", indicates that the system is attempting to insert a document with an ID that already exists in the store. This is part of the UPSERT process.

In practice, this means:

  1. If the document with this ID doesn't exist, it will be inserted.
  2. If the document exists but its content has changed (different hash), it will be updated.
  3. If the document exists and hasn't changed, the operation will effectively do nothing.
This strategy is useful for maintaining an up-to-date document store, especially when you're repeatedly ingesting documents that may have been updated since the last ingestion. It ensures that your document store always contains the most recent version of each document without creating duplicates.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai what if I do not set a doc_id when i load the documents?
Let me check the knowledge base for you again. One moment! :books:
----------------------
To answer your question about what happens if you don't set a doc_id when loading documents in LlamaIndex, let me explain based on the available information:

If you don't explicitly set a doc_id when loading documents, LlamaIndex will automatically generate a unique identifier for each document. This is part of the default behavior to ensure that each document can be uniquely identified within the system.

Here are some key points to consider:

  1. Automatic ID Generation: When you use a document loader like SimpleDirectoryReader without specifying document IDs, LlamaIndex will generate unique IDs for each document automatically. These IDs are typically UUID-based.
  1. Implications for Upserts: If you're using the DocstoreStrategy.UPSERTS as in your previous pipeline configuration, the behavior will be slightly different:
  • For new documents (first-time ingestion), they will be inserted with their automatically generated IDs.
  • For subsequent ingestions, since the IDs are generated anew each time, the system will treat these as new documents rather than updates to existing ones.
  1. Potential for Duplicates: Without consistent doc_ids across ingestions, you might end up with duplicate content in your index if you ingest the same documents multiple times.
To avoid potential issues with duplicates and to fully leverage the UPSERTS strategy, it's generally recommended to provide consistent and meaningful doc_ids when possible. For example, you can use the filename as the doc_id when loading documents from a directory:

Plain Text
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data", filename_as_id=True).load_data()
This approach, as shown in the LlamaIndex documentation on document management, ensures that documents have consistent IDs across multiple ingestions, allowing the UPSERTS strategy to work effectively.

If you can't use filenames or don't have a natural unique identifier for your documents, you might need to implement a custom ID generation strategy that remains consistent across ingestions for the same document.

Remember, the choice of whether to set custom doc_ids or rely on automatic generation depends on your specific use case and how you plan to manage document updates and duplicates in your application.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai i am not using simpledirectorreader i am just loading the documents in from text i parse from a previous json file
Let me check the knowledge base for you again. One moment! :books:
----------------------
I understand that you're loading documents from text parsed from a previous JSON file, rather than using SimpleDirectoryReader. In this case, you have more control over how you create and manage your documents. Let me provide some guidance on how to handle document IDs in this scenario:

  1. Creating Documents Manually:
When you create documents from parsed text, you can assign custom IDs to ensure consistency across ingestions. Here's an example of how you might do this:

Plain Text
from llama_index.core import Document

documents = []
for i, text in enumerate(parsed_texts):
    doc = Document(
        text=text,
        doc_id=f"custom_id_{i}"  # or any other unique identifier
    )
    documents.append(doc)


  1. Using Metadata for Identification:
If your JSON file contains metadata that can uniquely identify each document, you can use that for the doc_id. For example:

Plain Text
for item in json_data:
    doc = Document(
        text=item['text'],
        doc_id=item['unique_identifier'],
        metadata=item.get('metadata', {})
    )
    documents.append(doc)


  1. Generating Consistent IDs:
If you don't have natural unique identifiers, you could generate consistent IDs based on the content of the document. For example, using a hash of the content:

Plain Text
import hashlib

for text in parsed_texts:
    doc_id = hashlib.md5(text.encode()).hexdigest()
    doc = Document(text=text, doc_id=doc_id)
    documents.append(doc)


By assigning consistent doc_ids, you ensure that the UPSERTS strategy in your IngestionPipeline works effectively. This way, when you ingest the same document again (perhaps with updated content), it will update the existing document rather than creating a duplicate.
Remember, the key is to have a consistent method of generating or assigning doc_ids that remains the same across different ingestion runs for the same logical document. This allows LlamaIndex to properly manage updates and avoid unintended duplicates in your index.

If you need more specific guidance on handling your particular JSON structure or parsing method, please provide more details about your data format and parsing process.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord