The Data Bottleneck: Why Andreessen Horowitz Bet $30M on Protege
Welcome to Healthcare Markets & Technology.
Rigorous analysis of AI, policy, capital, technology, and clinical operations across U.S. healthcare — written for the people who build, invest in, and lead it.
Free subscribers get 2 public articles per week. Upgrade to paid → for the full 7 articles/week, paid podcast episodes, deal breakdowns, and the complete 538-deep-dive archive.
Subscribe or upgrade here →
One thing to bookmark: the searchable Knowledge Base at kb.onhealthcare.tech isn’t in Substack’s menu. Save it now — on mobile, tap share → “Add to Home Screen.”
Reply to any email with questions. I read every one.
— Trey
Table of Contents
The Exhaustion Problem
Why Travis May Built This Again
The a16z Investment Thesis
What Protege Actually Does
Why This Team Can Execute
Economic Realignment
What This Means for Builders
Abstract
The Exhaustion Problem
The progression of language models from GPT-2 to GPT-4 and beyond tells a clear story about the role of data in AI advancement. Early gains came from better architectures and more compute. Transformers beat LSTMs. Scaling laws held. More parameters plus more GPUs equaled better performance. But somewhere around 2023, the easy wins from architecture and compute started running into a hard wall. Not because the models could not scale further, but because the training data could not.
Common Crawl has been scraped to death. Reddit threads from 2012 have been ingested a dozen times over. GitHub repositories are exhausted. Wikipedia exists in every major model’s training corpus. The entire public internet, which seemed infinite when these projects started, turns out to be finite and largely consumed. Synthetic data generation helps at the margins but cannot replace real-world complexity. Models trained primarily on synthetic data tend to collapse into repetitive patterns or hallucinate in predictable ways when faced with novel situations.
The problem extends beyond text. Computer vision models need diverse, high-quality labeled images and videos that capture edge cases, rare events, and unusual lighting conditions. Audio models need clean recordings across accents, environments, and acoustic conditions. Robotics and embodied AI need sensor data from physical environments. Medical AI needs patient outcomes across diverse populations and treatment contexts. All of this data exists, but almost none of it is publicly available or easily accessible.
Meanwhile, model architectures are converging. The difference between leading frontier models has less to do with fundamental architectural innovations and more to do with training data quality, instruction tuning datasets, and RLHF approaches. When Anthropic releases a new Claude variant or Google ships an updated Gemini model, the competitive advantage often comes down to what data they trained on and how they curated it, not whether they invented a novel attention mechanism.
This creates an uncomfortable reality for AI builders. The next 10x improvement in model capability will not come primarily from buying more H100s or hiring more ML researchers. It will come from getting access to better training data. Specifically, real-world data that captures the messy, multimodal, high-stakes environments where AI systems will actually operate. The internet represents maybe 5% of the world’s total data. The other 95% sits in hospitals, enterprises, research labs, media archives, and operational systems. Unlocking that data is the problem Protege is solving.
Why Travis May Built This Again
Travis May has spent nearly two decades building data infrastructure companies, and Protege represents his third major swing at solving data fragmentation problems. His track record is about as good as it gets in enterprise data, with two successful companies already under his belt before starting Protege at age 37.
May co-founded LiveRamp in 2011 with Auren Hoffman, initially joining as VP of Product before becoming CEO. The company, originally called Rapleaf, built identity resolution infrastructure for marketing and advertising, becoming the dominant platform for how brands connected customer data across different systems while maintaining privacy. LiveRamp was acquired by Acxiom for $310M in 2014, later spun out as an independent public company in 2018, and at its peak was processing data connections for basically every major brand and publisher.
The a16z Investment Thesis
Andreessen Horowitz leading a $30M Series A extension in Protege in January 2026 signals strong conviction that data infrastructure will be foundational to AI advancement. The financing expanded the company’s initial $25M Series A from August 2025, bringing total funding to $65M since founding in 2024. Returning investors include Footwork, CRV, Bloomberg Beta, Flex Capital, and Shaper Capital.
The thesis breaks down into several components. First, data access is genuinely the limiting factor for AI advancement right now. a16z’s portfolio companies across AI and machine learning are all running into the same problem. They need diverse, high-quality training data and cannot get it efficiently. Startups are burning millions on business development to cobble together datasets. Even well-funded companies struggle to access the data they need at the speed AI development requires. This creates demand for infrastructure that solves the problem systematically.
Second, the market is massive and growing. AI is eating every industry, and every AI application needs training data specific to its domain. Healthcare AI needs patient data. Autonomous vehicles need driving data. Robotics needs sensor data from physical environments. Media companies need content libraries. The total market for training data could be larger than cloud computing because it cuts across every AI use case.
Third, network effects create defensibility. Once Protege has relationships with hundreds of data suppliers and dozens of major AI companies, new entrants face enormous barriers. Data suppliers will not want to manage relationships with multiple platforms. AI builders will not want to integrate with multiple data sources when one platform gives them everything. The winner in this market could be winner-take-most, similar to how Snowflake dominated cloud data warehousing or how Databricks dominated data lakehouse architecture.
Fifth, timing is critical and favorable right now. AI companies are desperate for training data as public datasets run out. Frontier labs are willing to pay substantial amounts for unique datasets. Data suppliers are waking up to the value of their data assets and looking for ways to monetize them. Regulatory frameworks around AI training data are still forming, creating an opportunity to help shape norms and standards. The window to build dominant data infrastructure is open but will not stay open forever.
The investment came from a16z’s Bio and Health team, with partners Daisy Wolf and Eva Steinman involved. This makes sense given Protege’s initial focus on healthcare data, though the platform has expanded into video, audio, and motion capture. The Bio and Health team’s involvement suggests a16z sees healthcare as the beachhead market but understands the platform will expand across verticals.
The $30M round size on top of a previous $25M suggests a16z expects Protege to scale quickly. This is not a seed investment in an unproven team testing product-market fit. It is a bet that the team can rapidly build supply and demand network effects before competitors emerge. The capital likely goes toward hiring engineers to build technical infrastructure, business development to sign data suppliers, sales to land AI company customers, and compliance infrastructure to operate across jurisdictions.
What Protege Actually Does
Protege operates as a two-sided marketplace connecting data suppliers with AI builders, but calling it a marketplace undersells the technical and operational complexity involved. On the supply side, Protege partners with hospitals, health systems, labs, imaging centers, research networks, media companies, and other data holders. According to the company’s announcements, Protege expanded its data partner network to hundreds of organizations in 2025, providing aggregated access to new data sources and formats.
Each partnership involves negotiating data licensing terms, building technical integrations to extract and normalize data, implementing privacy and compliance controls, and establishing revenue sharing arrangements. Protege provides revenue share payouts to data partners with each use, creating an economic incentive for data holders to contribute to the platform.
For healthcare specifically, Protege securely obtains patient data from multiple sources and stitches it into longitudinal, multimodal, anonymized patient-level datasets. This requires sophisticated entity resolution to match patient records across facilities without using identifiable information. A patient might have records at three different hospitals, two labs, and an imaging center, all under slightly different name spellings or with different identifiers. Protege’s algorithms match these records probabilistically while maintaining HIPAA compliance through tokenization and other privacy-preserving techniques.
The data itself comes in wildly different formats. EHR data arrives as HL7 messages, FHIR resources, or proprietary formats depending on the source system. Lab results use LOINC codes. Diagnoses use ICD-10. Medications use RxNorm. Imaging data lives in DICOM files. Clinical notes are unstructured text. Protege normalizes all of this into consistent schemas and data models that AI companies can actually use for training without building custom parsers for every data source.
Quality control happens at multiple stages. Protege validates data completeness, checks for anomalies, scores data quality, and flags potential issues before delivering datasets to customers. Bad training data causes model failures that might not surface until production, so quality assurance cannot be an afterthought. The platform tracks data lineage, versions datasets, and maintains audit trails for compliance purposes.
On the demand side, Protege serves frontier AI labs, AI application companies, and enterprises building internal AI capabilities. According to a16z’s announcement, Protege already works with the majority of MAG7 public companies plus many large private AI players. These companies use Protege to access curated datasets across healthcare, video, audio, motion capture, and other modalities without needing to negotiate hundreds of individual data partnerships.
The platform delivers data through multiple mechanisms depending on customer needs. Protege curates datasets from across its partner network to meet AI development needs, providing AI-ready data that integrates with modern ML workflows. The key value proposition is enabling AI builders to iterate quickly on model development rather than spending months or years on data acquisition and cleaning.
Beyond healthcare, Protege has expanded into other data modalities where similar problems exist. Media companies have vast archives of video and audio content that is valuable for training multimodal AI models but difficult to license at scale. Motion capture data from sports, entertainment, and research applications can train robotics and embodied AI systems. The same platform architecture that aggregates healthcare data can aggregate content libraries, with appropriate adjustments for different licensing and compliance requirements.
Why This Team Can Execute
The regulatory and compliance knowledge is maybe the most underrated advantage. Healthcare data is among the most heavily regulated in the world. HIPAA has complex requirements around de-identification, business associate agreements, breach notification, and audit trails. Different states have additional privacy laws. International markets have GDPR and other frameworks. May and Samuels have spent years working with healthcare lawyers, privacy officers, compliance teams, and regulators. They know what is permissible, what requires special handling, and how to structure agreements that satisfy all parties.
The engineering talent required to build this platform is also easier to recruit when the founders have successful exits and track records. Top data engineers want to work on hard problems with teams that have proven they can execute. Protege can attract senior technical talent from companies like Databricks, Snowflake, and Palantir by offering equity in a rocket ship with experienced founders who have built infrastructure companies before.
Economic Realignment
The emergence of Protege and similar data infrastructure platforms shifts economics throughout the AI stack in ways that are still playing out. For data suppliers, it creates new revenue streams that never existed before. Hospitals and health systems have always viewed patient data as a compliance burden and liability, not an asset. EMR systems cost millions to maintain, data teams prevent breaches, and sharing data opens up risk. But if you can monetize anonymized data for AI training while maintaining full compliance, suddenly that liability becomes valuable.
For healthcare providers specifically, the economics are compelling. A mid-size hospital system sitting on ten years of EHR data, imaging, and lab results represents significant value for training diagnostic models or clinical decision support systems. Previously, accessing that value required building internal data science teams, negotiating one-off partnerships, or simply leaving money on the table. Platforms like Protege that handle acquisition, anonymization, and licensing let providers generate revenue without adding headcount or compliance risk.
The revenue potential is meaningful relative to hospital margins. Health systems operate on thin margins, often 2 to 3% for non-profit hospitals. Adding a new revenue stream from data licensing, even if modest, can impact financial performance meaningfully. For struggling rural hospitals or safety-net providers, this could be the difference between staying open and closing.
Research networks and registries face similar dynamics. Organizations that collect patient outcomes data for specific conditions or treatments have spent years building these datasets for academic research. Now they can make that data available for AI development with appropriate protections, creating funding that makes their core research mission more sustainable. Disease-specific registries, tumor boards, and clinical trial networks all sit on valuable longitudinal outcome data that AI companies desperately need.
Media companies and content owners are waking up to similar opportunities. Major studios and broadcasters have massive video and audio archives that were previously just sitting in vaults or used for limited internal purposes. Training multimodal AI models on diverse video content has enormous value for companies building computer vision, video generation, or embodied AI systems. Licensing historical content for AI training creates a new revenue stream from otherwise dormant assets.
For AI builders, the economics flip from a major cost and bottleneck to a predictable expense. Instead of hiring business development teams to negotiate dozens of hospital partnerships, burning six to twelve months on each, companies can access curated datasets through Protege in weeks. Instead of building internal data engineering teams to clean and integrate heterogeneous sources, they get normalized data ready for training. The time and cost savings are substantial, but the strategic value is even larger.
Being able to iterate quickly on model hypotheses changes product development fundamentally. If you think adding a specific type of imaging data will improve diagnostic accuracy, you can test that in weeks rather than months or years. If a model performs poorly on certain patient populations, you can quickly source additional training data to address the gap. Speed of iteration becomes a competitive advantage, and Protege enables that speed.
Pricing models will be critical for how this plays out. Traditional enterprise data deals involve lengthy negotiations, volume commitments, and opaque pricing. That works for established companies with data budgets but kills startup experimentation. If Protege can offer transparent, usage-based pricing aligned to startup economics, it enables a much broader set of AI builders to access valuable training data. This is similar to how AWS democratized infrastructure access compared to buying your own servers.
There are interesting dynamics around data exclusivity and competitive advantage. Should leading AI companies be able to license exclusive access to certain datasets? Does that create unfair advantages, or is it just normal competitive tactics? Protege needs to balance enabling competition with allowing differentiation. The likely equilibrium involves a mix of widely available datasets that level the playing field and exclusive arrangements for unique data sources, similar to how cloud infrastructure works today.
The revenue split between Protege and data suppliers also matters. If Protege takes too much margin, data suppliers will try to go direct or use competing platforms. If Protege gives away too much margin, the business will not be sustainable or profitable enough to justify a16z’s valuation expectations. The right split probably varies by data type, exclusivity, and supplier bargaining power. Large health systems have more leverage than small research networks. Unique datasets command better economics than commoditized data.
What This Means for Builders
For founders building AI companies, the implications of mature data infrastructure shift strategic priorities in several ways. Data strategy moves from being primarily a business development and operations challenge to being a product and engineering question. Instead of hiring salespeople to negotiate hospital partnerships, you hire ML engineers to evaluate dataset quality and design training pipelines. Instead of building custom ETL for each data source, you integrate with standardized APIs.
This lowers barriers to entry for new AI applications that were previously too difficult for startups to pursue. Building a diagnostic radiology model used to require years of partnerships before training the first model. Now you can get started in weeks. That opens up entire categories of healthcare AI that were only accessible to well-funded, experienced teams. The same pattern will play out in other verticals as Protege and similar platforms expand beyond healthcare.
Competitive dynamics shift toward model architecture, training techniques, and application-specific optimization rather than pure data access. When everyone can access similar baseline datasets, differentiation comes from what you do with the data. This is probably healthier for innovation overall, since it rewards technical capability rather than just partnership skills. Companies compete on actual AI capabilities instead of who negotiated better data deals.
For investors, data infrastructure platforms represent a different risk-return profile than typical SaaS businesses. Network effects are strong once you have critical mass on both supply and demand sides. Marginal costs for incremental data sources and customers are relatively low compared to initial platform development. Switching costs are moderate to high once AI companies integrate data pipelines into training workflows. The business model looks more like marketplace economics than traditional software.
Revenue concentration around a small number of large AI customers creates risk but also validates product-market fit. If the leading frontier model builders all use your platform, that proves the value proposition strongly. The question becomes whether you can expand beyond anchor customers to serve the long tail of AI builders. Protege already working with majority of MAG7 companies plus large private AI players suggests they have the anchor customers locked in. Expanding to mid-market and smaller companies will determine ultimate market size.
The broader pattern extends beyond healthcare to any domain with valuable private data. Manufacturing and industrial companies with sensor data from physical processes could enable embodied AI and robotics. Financial institutions with transaction data could train better fraud detection and risk models. Telecommunications companies with network data could improve infrastructure optimization. The playbook established in healthcare likely applies across multiple verticals, each of which could be as large as healthcare alone.
What remains uncertain is whether data infrastructure becomes a winner-take-most market or supports multiple specialized platforms. Arguments exist on both sides. Network effects and economies of scale in building supply relationships favor concentration. But vertical specialization, regional focus, and different data modalities might support multiple winners. Healthcare alone might sustain several platforms focusing on different data types or customer segments. The next few years will determine market structure.
The timing question matters significantly. Data infrastructure platforms that establish themselves now, while frontier AI labs are desperate for training data, will be sticky even as the market matures. Companies that wait risk entering a market with established incumbents and locked-up supply relationships. For entrepreneurs, the window to build in this category is open but probably measured in quarters, not years. Protege’s $65M in funding and a16z backing will accelerate their timeline and make it harder for followers to catch up.

