We started Blossom because we found ourselves caught between two incomplete answers: research that never met production, and products that were dressed-up wrappers around models nobody had really studied. We thought there was room for one company that did both well.
Modern AI is good enough to deploy and not good enough to deploy carelessly. That gap — between what models can do and what businesses can trust them to do — is where most operationally serious work breaks down today.
Our bet is that closing that gap is a research problem, not just a tooling one. So we built a lab and a product company together, and we run them as a single team. The lab studies how AI should reason, route, and collaborate with experts. The product takes those findings into businesses where the cost of being wrong is real.
The questions worth answering only show up under load — and the products worth shipping demand the rigor of a lab.
A small founding team, one thesis: research and product belong in the same building. The lab is incubated in Tokyo as the research arm; the first weeks are spent writing down what we believe.
A first routing experiment ships on real engineering traffic — choosing between models on cost, latency, quality, and privacy. The exercise becomes the foundation for Furiwake, the routing-and-evaluation layer for agent labor.
A partner runs the platform on live traffic. We learn more in three months than in the prior year.
Rensei (annotation + benchmark governance) joins the platform, and RL Sim — a simulated environment where agents rehearse production tasks before they touch real traffic — enters beta. Each shares the same evaluation and observability substrate as Furiwake.
Across two offices in Tokyo and San Francisco, the lab and product organisation share a single weekly review. The next year is about depth, not breadth.
A research question earns its place by naming where it bites in production. We work backwards from the operator’s problem.
Agents make a thousand mistakes in our environments before they make their first real one. The fidelity of the sim is the bar we hold ourselves to.
Domain expertise should travel into our models faster than it travels out. Specialists correct, audit, and steer; the model knows when to ask.
Wrong with confidence costs more than right with humility. Refusal, deferral, and uncertainty are first-class outcomes.
Papers, notes, benchmarks. The product benefits, but the field has to be able to check our work.