joey

Doc

In general why does it take longer to append a doc to an existing vector index than it takes to just rebuild the entire index?

1 comment

jjoey

Simple directory reader throws error when num_workers set to 1

Hello folks, SimpleDirectoryReader load_data() throws the following error if I set num_workers=1, but does not when num_workers is greater than 1

Plain Text

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

This is within a ProcessPoolExecutor task kicked off within a fastAPI/starlette background_task

Any ideas? I can just set num_workers=4 for now, but I'd like to understand why this happens

versions:

Plain Text

llama-index-core 0.11.9
llama-index-readers-file 0.2.1

7 comments

jjoey

MarkdownReader broken?

Solved (thanks !):
You have to instantiate the readers.

Correct: ".md": MarkdownReader(),

Incorrect:".md": MarkdownReader,

MarkdownReader broken?

When I try to use my own set of file_extractors, I get the following error:

Plain Text

Failed to load file /app/data/manual.md with error: MarkdownReader.load_data() missing 1 required positional argument: 'file'. Skipping...

Code:

Plain Text

file_extractor = {
    ".csv": PandasCSVReader,
    ".docx": DocxReader,
    ...
}
SimpleDirectoryReader(
    input_dir=self.knowledge_path,
    file_extractor=file_extractor,
).load_data()

But this goes away if I just use default extractors. Any ideas?

5 comments

jjoey

pgvector-python/CHANGELOG.md at master ·...

version conflict: llama-index-vector-stores-postgres constrains its pgvector version to <0.3.0. But pgvector is now on version 0 .3 .2 -- pinning to 0.2.5 for now but would be good to increment since it disrupts onboarding flow

1 comment

jjoey

I recognize that any decent answer will

I recognize that any decent answer will have plenty of qualifiers, but what do you all use as a "~80% good enough" starting point for building index, when time to index is not a constraint? It's tempting to use fancy chunkers and add in all the extractors. And I know the real answer is to experiment and evaluate whats best for your dataset. But like, is there any "general" recommendation or opinion go-to for what to try after the basic SimpleDirectoryReader + VectorStoreIndex?

5 comments

Find answers from the community

Doc

Simple directory reader throws error when num_workers set to 1

**MarkdownReader broken?**

pgvector-python/CHANGELOG.md at master ·...

I recognize that any decent answer will

MarkdownReader broken?