I'm in the very early stages of learning

At a glance

I'm in the very early stages of learning LlamaIndex. First the first time I'm now trying to use ChromaDB instead of just writing my index directly to disk.

But I get an error from ChromaDB about errornous metadata: "ValueError: Expected metadata value to be a str, int, float or bool, got None which is a <class 'NoneType'>"

If I just write the same index to disk, it works. What is it that ChromaDB needs?

The documents I want to store embeddings for is just a couple of plain-text documents with some metadata in YAML format ("key: value").

17 comments

LLogan M

Seems like something in your metadata is None -- but it should be str, int, float, or bool for chroma to work

tthoresson

Any idea on what metadata ChromaDB is referring to? Even when I strip the YAML from my text files ChromaDB throws the same error. Running the same code on the Paul Graham essay used in many of the examples in the documentation works.

LLogan M

I have no idea which metadata 😅 It depends on what data you loaded I suppose, how you loaded it, etc.

I would just cast my metadata 🤷‍♂️ Kind of silly that chroma doesn't do that for you.

Plain Text

for doc in documents:
  for key, val in doc.metadata:
    if val is None:
      doc.metadata[key] = "None"

tthoresson

Strange thing is that I don't do any manual work with metadata at all. So I've no idea neither. 🤣

And to make it even stranger – at least to me – I just worked around the error: By changing the file suffix from .md to .txt That was the only thing different between the working test essay and my own data. And now it suddenly passed without any errors.

LLogan M

that suffix change will make the file load with a different file-loader

LLogan M

likely the md file loader is parsing your file slightly incorrectly? 🤔 It does some special handling for headers and whatnot

tthoresson

In the load_data() method for SimpleDirectoryLoader?

LLogan M

SimpleDirectoryReader is a wrapper around several different readers for different file types.

Depending on the file suffix, a different reader is used for each file

tthoresson

Thanks! Turns out that the MarkdownReader class does in fact return a metadata key with the value set to None.

Attachment

LLogan M

lol that's weird

LLogan M

should probably update that

tthoresson

No, not the MarkdownReader class.

tthoresson

But probably the default_file_metadata_func

LLogan M

I wonder why file_type would be none. In any case, it should probably be avoiding inserting None values

tthoresson

Seems strange to me as well. This is how the file_typekey is set in base.py:

`"file_type": mimetypes.guess_type(file_path)[0]``

Getting the file type from the file_path should be pretty straightforward?

LLogan M

I guess it's meant to represent a mimetype rather than an actual file_type ?

tthoresson

I don't fully get how mimetypes.pyworks, it seems like it might fetch a list of filetypes from a couple of different locations. But there is also a hardcode list with file suffixes where .mdis missing. Adding mdto that list solved the problem.

Add a reply

Find answers from the community

I'm in the very early stages of learning