Data Privacy in Automations
The AI Kit makes it easy to send any data through an AI model. That convenience cuts both ways — sensitive data can end up in places you did not intend. This page describes the practices we recommend for protecting data inside your automations.
This is not legal advice. Consult your data-protection officer or counsel for specific regulatory questions (GDPR, HIPAA, sector-specific rules).
A simple framing: where does my data go?
Every step in an automation either:
- Stays inside the AI Kit (memory, file writes to local disk, knowledge queries).
- Reaches a system you control (your database, your mail server, your internal HTTP API).
- Reaches a third party (OpenAI, Anthropic, Mistral, Google, MinIO if hosted elsewhere).
The first two are usually fine — you decide what to keep, you decide who has access. The third is where most privacy concerns live.
When designing an automation, classify each step:
| Step | Where does data go? | Sensitive content allowed? |
|---|---|---|
| LLM Prompt with OpenAI / Anthropic / Mistral | Third party | Only after deliberate review |
| LLM Prompt with self-hosted Ollama | Stays local | Yes |
| Database Query against your DB | Your system | Yes |
| HTTP Request to a public API | Third party | Same as the API's policy |
| File Writer | Stays local | Yes |
| Send E-Mail | Depends on the SMTP target | Treat as third party if the SMTP host is external |
Decide what is sensitive
The category of "sensitive" varies by organization and by regulation. A practical baseline:
- Personal data — names, e-mail addresses, phone numbers, addresses, employee IDs, customer IDs.
- Health, financial, legal data — almost always sensitive.
- Internal business data — financial forecasts, strategy documents, unannounced product details.
- Credentials — passwords, API keys, tokens. Sensitive at all times, by everyone's standard.
For each automation, write a one-liner: "This automation processes [category]. Sensitive: yes / no." If yes, the next sections apply.
Patterns that work
Pattern A: keep everything local
Use self-hosted Ollama for all AI steps. No data leaves your network.
- Pros: simplest privacy story.
- Cons: open-source models still trail proprietary ones on the hardest tasks. Costs are infrastructure rather than per-token.
Pattern B: anonymize before external AI
Strip identifying information from the data before it reaches an external provider, then restore it on the way back. The platform's Anonymizer and Deanonymizer integrations are designed for this — see also Pseudonymization.
- Pros: lets you use the best models on the market while keeping personal data internal.
- Cons: anonymization is statistical, not perfect. For very high-risk data, do not rely on it alone.
Pattern C: external AI on already-public data only
Some automations operate on data that is already public — press releases, public web pages, generic templates. External AI is fine here without ceremony.
- Pros: simple, fast.
- Cons: requires you to actually verify the data is public.
Pattern D: hybrid
Different steps use different providers. Triage with a small external model on already-anonymized data, then escalate to a self-hosted model for the sensitive parts.
Anti-patterns
- ❌ Sending whole user records to an external LLM without anonymization. The data is now in the provider's logs.
- ❌ Using debug
LLM Promptsteps that dump full memory to "see what is going on". That memory may contain anything the prior steps produced. - ❌ Storing customer data in knowledge bases to "make the agent aware of it" without an access-control story. The knowledge is searchable by any agent connected to it.
- ❌ Logging full request bodies through an HTTP Request step's memory output. Logs land in job details, which sit in the data volume.
Cross-cutting practices
- Minimize. Only send what the step actually needs. If a single field will do, do not send the whole record.
- Annotate. Use the automation's description field to note its privacy properties. A future maintainer will thank you.
- Audit. Periodically open a few jobs in the Jobs view and check what is actually being sent. Drift happens.
- Retain less. The platform prunes finished workflow jobs on a schedule, but agent jobs are kept forever. For sensitive agent data, consider deleting old conversations explicitly.
Recommendations
- ✅ Write a one-line privacy classification on every automation that touches anything beyond fully-public data.
- ✅ Default to self-hosted Ollama for sensitive workflows. Switch to external providers only when the quality gap matters.
- ✅ Combine anonymizer + deanonymizer in any automation that sends user-identifying text to external AI.
- ✅ Limit who can edit automations. The editor is privileged — see Administration → Manage Users.
- ⚠️ Cloud AI providers may retain prompts for a period that depends on your contract with them. Read the contract; do not rely on assumption.
- ❌ Do not treat self-hosted as "automatically safe". A self-hosted model running on a server reachable by everyone in the company is no more private than an external one.
What to do next
- Pseudonymization — the practical how-to for the anonymize/deanonymize pipeline.
- Logging and Compliance — what the platform writes about runs.
- Models → Ollama (Self-hosted) — the local-only path.