Hey how do I set a custom doc id when

At a glance

The community member asked how to set a custom doc_id when loading documents from different data loaders like YouTube transcript and S3. Another community member suggested setting the doc_id after the documents have been loaded, for example, documents[0].doc_id = "my_doc_id".

The original poster then asked if the returned documents would have metadata with the filename, so they could set the doc_id based on the filename. They also asked if the documents would be returned in the same order as the input (e.g., YouTube URLs or S3 folder).

A third community member responded that it depends on the implementation of the data loader, and suggested reading the source code to understand how it works. They noted that for the YouTube loader, the doc_id would need to be set manually without metadata, and that it could be a good idea to contribute a pull request to ensure the metadata is set for the loaders being used.

Useful resources

mmrmvp

Hey, how do I set a custom doc_id when loading documents from different data loaders like youtube_transcript, s3 etc ?

4 comments

LLogan M

you can set them after they've been loaded

Plain Text

documents = ....
documents[0].doc_id = "my_doc_id"

mmrmvp

Thanks. would the returned documents have metadata with filename, so I can set the doc id according to the filename? Let's say I load in data from YouTube urls [url1, url2] would the documents returned be in the same order so I can set the first doc id as "url1"? Same with s3 when I load in documents from a folder

LLogan M

Depends on the implementation of the data loader from the community tbh lol

I would read the source code for the respective loader to see what's going on under the hood

The YouTube loader you would have to manually set the doc id without metadata, for example
https://github.com/emptycrown/llama-hub/blob/a109e482407586e98b731bf557700b4cc4fc706a/llama_hub/youtube_transcript/base.py#L29

Would be easy PRs to make sure the Metadata is set for the loaders you use 👌

mmrmvp

Will do that. Thanks!

Add a reply

Find answers from the community

Hey how do I set a custom doc id when