This is difficult to do generally. The big providers have concepts for "moderation" but it is pretty opaque generally. Some things they are likely doing
- filtering specific terms (input &| output)
- filtering based on embeddings (i/o)
- prompt based techniques
- using an llm to check (i/o)
They are generally more open ended systems, so restricting to what you should know about (like if RAG doesn't return anything meaningful, reply you don't know), the problem may get easier.
Prompt injection is another area to look into, as it will show the cleverness with which one can break these protection systems, to the point it is likely an impossible problem to solve.