Hybrid
San Fransisco, CA
$180-$220k + Equity
December 5, 2025
Apply NowOur client is a fast-growing AI start-up building foundational AI safety and reasoning systems designed to keep advanced AI models aligned, reliable, and under human oversight. As AI models become more powerful, the metrics, evaluations, and data intelligence behind them become mission-critical. The team is developing the measurement systems that ensure models behave safely, consistently, and according to human-aligned objectives.
If you want to shape the analytics, benchmarking, and insight engine behind cutting-edge AI safety work—this role gives you the chance to define how next-generation AI systems are evaluated and understood.
What You’ll Do
- Build and own the company’s internal and external metrics frameworks, defining how model quality, safety, reliability, and performance are measured
- Develop BI dashboards, reporting layers, and analytics tools to track model behavior, product adoption, and system health
- Run post-hoc data analysis to guide product decisions, uncover failure modes, and validate new features
- Benchmark models and systems against leading LLMs, designing comparative evaluations and performance studies
- Build, design, and analyze A/B tests to measure impact and drive data-backed product improvements
- Collaborate with ML, product, and research teams to translate ambiguous questions into structured analyses and actionable insights
- Work with the engineering team to ensure high-quality data flows, instrumentation, and measurement pipelines
What We’re Looking For
- 3–6+ years of experience in data science, applied analytics, or research-driven data work
- Strong proficiency with Python, SQL, and statistical analysis libraries
- Experience building metrics systems, dashboards, instrumentation, or analytical reporting tools
- Ability to design and run A/B tests, experiments, and causal inference analyses
- Familiarity with evaluating machine learning models—especially LLMs—through benchmarks or custom evaluation suites
- Comfortable with ambiguous problem spaces and translating qualitative goals into quantitative frameworks
Bonus Points
- Experience in any of the following:
- AI Safety, AI Security, or Trust & Safety evaluations
- LLM evaluation frameworks, rubric-based scoring, or red-teaming data
- Building analytics or evaluation systems for ML-driven products
- Working with experiment platforms, BI tools, or observability stacks
- Prior experience partnering with ML research teams on model evaluation, bias testing, or safety metrics
