Page number gets added for PDF on its own only I guess. You could use
PDFReader
to extract content from the PDF file and for rest you can use
UnstructuredReader
Sample code would look like this
from llama_index import download_loader
from llama_index import SimpleDirectoryReader
from llama_index.readers.file.docs_reader import PDFReader
UnstructuredReader = download_loader('UnstructuredReader')
dir_reader = SimpleDirectoryReader('./data', file_extractor={
".pdf": PDFReader(),
".html": UnstructuredReader(),
".eml": UnstructuredReader(),
})
documents = dir_reader.load_data()