Skip to main content
Back to blog
·8 min read·Ryan Howell

The Legal Risks AI Startups Don't See Coming

Building with AI creates legal exposure that traditional SaaS never had — from training data copyright to model output liability to open source licenses that aren't actually open source. Here's what founders need to lock down before it becomes a due diligence problem.

ipcompliancefounders

Every startup ships AI features now. Some are building foundation models. Most are wrapping APIs, fine-tuning open-weight models, or using AI to automate something that used to take a human.

The product risk is well understood: hallucinations, accuracy, UX. The legal risk is not. And it's the legal risk that shows up in your Series A diligence, your first enterprise security review, or — worst case — a cease-and-desist you didn't see coming.

Here's what actually matters.


Training Data and the Copyright Question

If you're fine-tuning a model on data you collected, scraped, or licensed, the first question is whether you had the right to use it.

The major AI copyright cases — The New York Times v. OpenAI, Getty Images v. Stability AI, Concord Music v. Anthropic — are all still working through the courts. Fair use as a defense for training on copyrighted material at scale is an open legal question. Courts are drawing early lines: fair use may protect training on lawfully obtained data, but companies face significant risk where datasets include pirated, scraped, or improperly sourced material.

What this means for startups:

If you're using a commercial API (OpenAI, Anthropic, Google), the training data liability is largely theirs to manage. Your risk is lower but not zero — if your product generates output that's substantially similar to copyrighted material, you could still face claims.

If you're training or fine-tuning your own models, you need to know exactly where your data came from. "We scraped it from the internet" is not a defensible position. Document your data provenance. If you licensed datasets, keep the agreements. If you're using customer data to fine-tune, make sure your terms of service explicitly permit it — and give customers the ability to opt out.

Series A investors will ask about this. Have an answer.


Model Output Liability

When your AI product gives a wrong answer, who's responsible?

The legal framework depends on what the output is and what the user does with it:

Information products. If your AI summarizes legal documents, generates financial analysis, or provides health-related recommendations, you're potentially liable for the content of those outputs — even with disclaimers. Courts haven't fully tested this, but the direction is clear: if your product looks like professional advice, you'll be held closer to that standard.

Decision automation. If your AI makes decisions that affect people — credit scoring, hiring screening, content moderation — you're in regulated territory. Bias in model outputs creates discrimination liability under existing law. The EU AI Act explicitly classifies high-risk AI systems and imposes conformity requirements.

Generated content. If your product generates text, images, or code that infringes someone's copyright or defames a real person, there's an open question about whether you or the user is liable. The safest assumption: if your system produced it, you'll be named in the suit.

What to do about it:

Your terms of service need to be specific about what your AI does and doesn't promise. Broad disclaimers help, but they don't eliminate liability — they reduce it. If your product operates in a regulated domain, talk to counsel before launch, not after.

Your enterprise contracts should clearly allocate responsibility for AI outputs between you and the customer. This is especially true for B2B products where the customer is relying on your AI to make or support business decisions.


Open Source Licensing in AI Stacks

This is where founders consistently underestimate the risk.

Modern AI products are built on layers of open source: model weights, inference frameworks, training libraries, agent toolkits, vector databases. Each layer has a license. Those licenses have different obligations, and some of them are incompatible with commercial products.

The permissive tier — MIT, Apache 2.0, BSD. Use freely in commercial products with minimal obligations (attribution, license inclusion). Garry Tan's gbrain (MIT), LangChain (MIT), and most of the popular inference frameworks sit here. If your entire AI stack is permissive-licensed, you're in good shape.

The copyleft tier — GPL, LGPL, AGPL. These require that derivative works be distributed under the same license. For AI, the question is what constitutes a "derivative work" — if you modify GPL-licensed training code and distribute the result, you may be required to open-source your modifications. AGPL extends this to software accessed over a network, which catches SaaS products that GPL doesn't. If you're running AGPL-licensed components in your inference pipeline, understand what that means before you ship.

The "open" tier that isn't — Meta's Llama models are marketed as open source but ship under a custom license with real restrictions: a 700 million monthly active user threshold (above which your license expires), a competitor restriction that blocks specific companies, and a prohibition on using Llama outputs to train other models. Mistral and other model providers have their own variations. These aren't MIT. Read the license before you build on them.

What to do about it:

Run a license audit of your full AI stack — model weights, libraries, frameworks, datasets. Know what licenses you're subject to before you're in a diligence process where someone else finds out for you. Your IP assignment agreements should cover AI-assisted work product, and your open source policy should specify which license categories are approved for use in production.


Customer Indemnification

Enterprise customers will ask you to indemnify them against IP infringement claims arising from your AI product's outputs. This is becoming standard in B2B AI contracts.

The question is how far you're willing to go.

What's reasonable: Indemnifying the customer against third-party IP claims arising from your platform's core technology — the model, the training data, the infrastructure. This is analogous to standard software IP indemnification.

What's not: Indemnifying against claims arising from how the customer uses your product, what prompts they provide, or what they do with the outputs. If a customer feeds your AI proprietary data from a competitor and generates something infringing, that shouldn't be your liability.

The practical approach: Define a clear boundary in your contract. You indemnify for platform IP. The customer indemnifies for their inputs and their use of outputs. Carve out specific scenarios that are outside your control. Enterprise buyers will push back — negotiate the scope, not the principle.

OpenAI, Microsoft, Google, and Anthropic all offer some form of IP indemnification for their API customers now. If you're building on their APIs, understand what their indemnification covers and where your product creates incremental risk that sits outside their coverage.


Upstream Provider Terms Flow Downstream

If your product is built on OpenAI, Anthropic, or Google's APIs, your customers' use of your product is bound by those providers' acceptable use policies — whether your customers know it or not.

Most API providers prohibit specific use cases: generating CSAM, facilitating violence, mass surveillance, certain regulated applications. If your product enables a use case that violates your upstream provider's terms, they can cut your API access without notice. That's an existential risk if you have no fallback.

What to do about it:

Your own terms of service should reflect — or be more restrictive than — your upstream providers' acceptable use policies. If OpenAI prohibits a use case and your product enables it, the risk is yours. Map your upstream restrictions to your downstream terms and make sure they're consistent.

If you're building on multiple providers (common for redundancy), understand that their policies differ. What's permitted on one may not be on another.


What to Lock Down Before Your Next Raise

Investors are getting sharper on AI-specific diligence. Here's the checklist:

  • Training data provenance — documented source, license, or consent for every dataset used in fine-tuning
  • Open source license audit — full dependency tree, including model weights, with license classification
  • Terms of service — specific language covering AI output disclaimers, use restrictions, and data handling
  • Customer contracts — clear IP indemnification scope with defined boundaries for platform vs. customer responsibility
  • IP assignment agreements — updated to cover AI-assisted and AI-generated work product
  • Upstream provider compliance — your terms are consistent with (or stricter than) your API providers' acceptable use policies
  • Privacy and data handling — customer data used for model improvement requires explicit consent with opt-out capability

None of this is theoretical. It's what Series A counsel will ask about, and having clean answers shortens diligence and strengthens your negotiating position.


The legal landscape for AI is moving fast and most of the hard questions haven't been resolved by courts yet. That's exactly when getting your contracts and policies right matters most — because you're setting the terms before precedent does it for you.

Need legal guidance for your startup?

Book a free intro call and see how Flux can help.

Book a Free Call