Find answers from the community

j
joey
Offline, last seen last month
Joined September 25, 2024
j
joey
·

Doc

In general why does it take longer to append a doc to an existing vector index than it takes to just rebuild the entire index?
1 comment
L
Hello folks, SimpleDirectoryReader load_data() throws the following error if I set num_workers=1, but does not when num_workers is greater than 1

Plain Text
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.


This is within a ProcessPoolExecutor task kicked off within a fastAPI/starlette background_task

Any ideas? I can just set num_workers=4 for now, but I'd like to understand why this happens

versions:
Plain Text
llama-index-core 0.11.9
llama-index-readers-file 0.2.1
7 comments
j
L
Solved (thanks !):
You have to instantiate the readers.

Correct: ".md": MarkdownReader(),

Incorrect:".md": MarkdownReader,

MarkdownReader broken?

When I try to use my own set of file_extractors, I get the following error:
Plain Text
Failed to load file /app/data/manual.md with error: MarkdownReader.load_data() missing 1 required positional argument: 'file'. Skipping...

Code:
Plain Text
file_extractor = {
    ".csv": PandasCSVReader,
    ".docx": DocxReader,
    ...
}
SimpleDirectoryReader(
    input_dir=self.knowledge_path,
    file_extractor=file_extractor,
).load_data()

But this goes away if I just use default extractors. Any ideas?
5 comments
L
j
version conflict: llama-index-vector-stores-postgres constrains its pgvector version to <0.3.0. But pgvector is now on version 0.3.2 -- pinning to 0.2.5 for now but would be good to increment since it disrupts onboarding flow
1 comment
L
I recognize that any decent answer will have plenty of qualifiers, but what do you all use as a "~80% good enough" starting point for building index, when time to index is not a constraint? It's tempting to use fancy chunkers and add in all the extractors. And I know the real answer is to experiment and evaluate whats best for your dataset. But like, is there any "general" recommendation or opinion go-to for what to try after the basic SimpleDirectoryReader + VectorStoreIndex?
5 comments
J
j
L