Find answers from the community

Updated 2 years ago

I have 2 data

At a glance
I have 2 data

ra_2019.txt = 24340 lines
ra_2020.txt = 10759 lines

when i tried to save them on disk using: index.save_to_disk('index.json')

the first one ra_2019.txt is successful while the ra_2020.txt fails

the error for ra_2020.txt is ValueError: A single term is larger than the allowed chunk size.Term size: 8210Chunk size: 3566

any idea how to fix this? im not sure why im getting this error since ra_2020.txt is much more smaller than ra_2019.txt. Thanks
Attachment
image.png
j
J
9 comments
by default we use a space-based separator for text. does ra_2020.txt have large "tokens" where a single "token" is very large?
How can I verify does ra_2020.txt have large "tokens" where a single "token" is very large?

btw this the data inside ra_2020.txt
https://github.com/ksromero/test/blob/main/ra_2020.txt
Ooh hm interesting. thanks for providing this data, i'll take a look at this soon
I tried to debug this, I used different separator "\n" and it worked.
Attachment
image.png
The data from the txt file came from html table, and when I extract the text of html using beautiful soup. the table will result like this (encircled), I just think that the space base separator wont work here because it doesn't have a space and it will just continue counting the tokens and thats why the condition failed and throw the error.
Attachment
image.png
thanks @James Moriarty . i can try updating the default separator to account for both newlines and spaces
so that stuff like this will be less likely to happen in the future
Add a reply
Sign up and join the conversation on Discord