The community member asked how to set a custom doc_id when loading documents from different data loaders like YouTube transcript and S3. Another community member suggested setting the doc_id after the documents have been loaded, for example, documents[0].doc_id = "my_doc_id".
The original poster then asked if the returned documents would have metadata with the filename, so they could set the doc_id based on the filename. They also asked if the documents would be returned in the same order as the input (e.g., YouTube URLs or S3 folder).
A third community member responded that it depends on the implementation of the data loader, and suggested reading the source code to understand how it works. They noted that for the YouTube loader, the doc_id would need to be set manually without metadata, and that it could be a good idea to contribute a pull request to ensure the metadata is set for the loaders being used.
Thanks. would the returned documents have metadata with filename, so I can set the doc id according to the filename? Let's say I load in data from YouTube urls [url1, url2] would the documents returned be in the same order so I can set the first doc id as "url1"? Same with s3 when I load in documents from a folder