Find answers from the community

Updated 5 months ago

Hey there thanks for making this project

At a glance
Hey there, thanks for making this project. Is there any way for the program to generate sources? For example, say I have a database of essays from Paul Graham. If I ask "what are the important things about starting a startup", how difficult would it be to let the program show the page numbers, or even just the name of the essays after each sentence?
j
S
10 comments
you get a response object from a query: https://gpt-index.readthedocs.io/en/latest/guides/usage_pattern.html#parsing-the-response. This response contains source_nodes which contain the underlying text chunk and doc id. does that help solve your use case?
This is great! I'll look into it. Thanks
@jerryjliu0

Would you recommend bulk setting doc_id to the names of the files or utilizing metadata and extracting the information from there?

As an example, I have a large number of exported PDFs with names in the format TIMESTAMP.pdf, and I would like to be able to search for specific content and have the corresponding file name displayed.
yeah you can either define file_metadata of type (Optional[Callable[str, Dict]]) as an arg in SimpleDirectoryReader, or you can manually set doc_id on the documents after retrieving them with SimpleDirectoryReader. Another option is there's an extra_info field on the Document you can set (it allows you to specify a general dictionary of metadata per Document)
Thanks, I've tried the first 2 but not the 3rd. Personally, what would you prefer to use?

Say we're trying to reduce query cost.
I suppose for something as simple as showing the filename it doesn't really matter
sorry just for me to understand, what does this have to do with reducing query cost? i can help you with that too, but the above suggestions just set id or metadata, but the index would still need to process the underlying text content
Sorry, I meant to ask which of those would affect the size of the index.json file the most?
I'm overthinking things. I think doc_id should work for now. And if I need more stuff later I'll use metadata. Thanks for the help!
Add a reply
Sign up and join the conversation on Discord