Find answers from the community

Updated last year

hey guys. ive tried both the trafilatura

At a glance

The community member is having an issue with their code where the results are getting truncated when using the trafilatura and simple web loaders to embed website contents. The community members discuss potential solutions, such as checking with the Qdrant team, manually chunking the output, and adjusting the chunk size and overlap using a SentenceSplitter. The solution provided by a community member is to change the chunk size and overlap as per the requirement to avoid truncation.

hey guys. ive tried both the trafilatura and simple web loaders to embed website contents, but my results keep getting truncated:

Plain Text
source: web

Doc ID: 1fb9d689-a2fd-4720-9f47-e01ee905b6f9
Text: Next Post Baked Garlic Parmesan Boneless Wings. This post may
contain affiliate links, please see our privacy policy for details.
Nine Favorite Things Happy February! This might be a little cheesy,
but I’ve been looking forward to the start of February for a while
now. I know it’s Valentine’s Day month, and while we don’t all love,
love Valentin...


does anyone know how i can adjust this so the results dont get truncated?
W
M
11 comments
How did you print this output?

trafilatura directly create a single Document object for each url.

From there while indexing your document gets converted into node object of definitive size maybe there you are getting it truncated
i checked with the qdrant team and they said there's no truncation on their end
How did you print the above output
this was an extract that's getting injected into my prompt but that's also what's stored in the text element of the textnode, and that itself is being truncated
i hacked it to chunk every 512 characters, but that's not ideal
Yes, thats the default node size
You can change it like this
Plain Text
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)
service_context = ServiceContext.from_defaults(text_splitter=text_splitter)

index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

Change the value of chunk size as per your requirement
oh heck yeah
this is what i was after
i did it manually lol
Add a reply
Sign up and join the conversation on Discord