thanks for the quick response @Logan M
I am new to this but i think that I am already doing what you said about splitting the web page into smaller sections.
The
json
file from the scrape is an array of objects, each object is this shape
[{
"url": "",
"metadata": {
"canonicalUrl": "",
"title": "",
"description": "",
"author": ,
"keywords": ,
"languageCode": ""
},
"text": ""
},...]
the
"text"
, from each object is about 3 paragraphs long, its not extremely long. then im running each object through this tranformer.
def transform_dataset_item(item):
url = item.get("url") or "" # default to an empty string if url is None
title = item.get("metadata", {}).get("title") or "" # same for title
description = (
item.get("metadata", {}).get("description") or ""
) # same for description
# Ensure that all the values are strings
url, title, description = str(url), str(title), str(description)
return Document(
item.get("text"),
extra_info={
"url": url,
"title": title,
"description": description,
},
)
So i think that each document is manageable. If this is all we can do, do you know if there are any wasy to speed up the "Synthesis over Heterogeneous Data approach"? its taking 13 seconds and it, by far had the best answers.