DiscoverOnBoard!EP 60. 全英文对话CRV投资人与LanceDB创始人:向量数据库下半场,大模型和多模态需要怎样的数据基建?
EP 60. 全英文对话CRV投资人与LanceDB创始人:向量数据库下半场,大模型和多模态需要怎样的数据基建?

EP 60. 全英文对话CRV投资人与LanceDB创始人:向量数据库下半场,大模型和多模态需要怎样的数据基建?

Update: 2024-09-13
Share

Description

OnBoard! 又一期全英文访谈来啦!去年采访 MosaicML ($1.3Bn 被Databricks 收购)的CTO Hanlin Tang 和 Sapphire Ventures 合伙人 Casber Wang 的那期节目很受欢迎,创始人和投资人从不同角度探讨一个话题的形式看来很值得再尝试一次。这次的两位嘉宾,Monica 也是期待已久啦!

Hello World, who is OnBoard!?

这次我们来聊聊硅谷一直以来的投资热点:大模型应用的数据基础设施。去年方兴未艾的 vectorDB (向量数据库),现在竞争格局有了怎样的演变?AI应用场景中多模态数据的增加对于 data infra 会带来怎样的挑战和机遇?

这两位身处硅谷一线的嘉宾,太适合深入探讨这个话题了:

创始人嘉宾 Chang She,LanceDB 的 Co-founder & CEO。LanceDB 是一个为多模态数据设计的开源向量数据库。Chang 是 data infra 的老兵了:他是著名的 Pandas library 的核心贡献者之一,他创立的 Datapad 几年前被Cloudera 收购。2022年,Chang 又开始了第二次创业征程,创立了LanceDB.

VC 嘉宾 Brian Zhan,是硅谷50年历史的顶尖老牌早期基金 CRV的投资人。他们最新一期基金超过$1.5Bn, 投资过的 startup 包括DoorDash、Airtable, Vercel 等等。Brian 曾在 Meta 做 data infra 产品经理,后来加入了开源数据库独角兽Starburst。少有的有技术和产品背景的 infra 投资人!

Brian 在2023年底领投了 LanceDB $8M seed轮, LanceDB 至今总融资额超过$11M. 现在,LanceDB 的用户已经囊括了一众头部 GenAI 公司,包括 Character.ai,Midjourney,Harvey 等等。

我们还畅谈了Chang作为连续创业者的心得,以及两位对开源商业化模式和 data infra 热点话题的一些犀利观点,他俩的配合也是非常有趣。Enjoy!

<figure></figure>

嘉宾介绍

  • Chang She (推特 @changhiskhan): Co-founder & CEO @LanceDB. 曾任 Tubi VP Engineering, 2013年创立的 Datapad 被Cloudera 收购。Pandas library 的核心贡献者。
  • Brian Zhan(推特 @brianzhan1):Investor @CRV. 加入CRV 之前,在 Meta 和 Starburst 担任 Presto 产品经理。
  • OnBoard! 主持:Monica:美元VC投资人,前 AWS 硅谷团队+ AI 创业公司打工人,公众号M小姐研习录 (ID: MissMStudy) 主理人 | 即刻:莫妮卡同学

我们都聊了什么

02:15 Speakers' self-intro, which data infra project Chang found interesting

05:20 Why CRV invested in LanceDB

07:50 Why Chang started LanceDB, and why customers use Lance and LanceDB

18:36 Investor's view on VectorDB - how LanceDB stand out from the competition? Why does it have the potential to become a platform?

27:47 Will there be a convergence of vectorDB? How do we think about competition from incumbent databases such as PGVector by Postgres?

32:57 Takeaways from the announcements from Databricks and Snowflake summits in June 2024

36:15 When do we need a new data format? Why is opensource important for data format?

43:14 How will AI change the data infra landscape? What will stay, what will be replaced, and what will emerge?

52:31 Why does Chang think that RAG is similar to recommendation systems?

55:34 How to evaluate if a new opportunity is for incumbents or startups?

57:57 What are some common mistakes in building data infra? Why does Chang think that opensource is not a default mode?

60:05 How to view OpenAI's acquisition of Rockset?

74:14 Is RAG system here to stay?

79:11 Chang's lessons as a second time founder? Advice to technical founders.

87:04 Brian: What early investors look for in early stage startups

90:47 What do the speakers find exciting about AI in the next 1-3 years? AI agents, healthcare, robotics, multimodal (voice, video gen)

99:36 Quick-fire questions: book recommendations, what's underrated and overrated, oat milk and pressure relief

我们提到的内容

  • LanceDB: An open-source vector database designed for multi-modal data.
  • Lance format: A storage format that improves the performance of LanceDB.
  • Panda: A popular Python library for data analysis and manipulation.
  • HDFS: The Hadoop Distributed File System, a scalable storage system for large datasets.
  • Cloudera: A leading provider of enterprise data cloud solutions.
  • Data fusion: The process of combining data from multiple sources into a unified view.
  • Presto: A distributed SQL query engine for big data analytics.
  • Parquet: A columnar storage format that is efficient for data analysis.
  • Postgres: A powerful, open-source relational database management system.
  • PGVector: An extension for PostgreSQL that adds support for vector embeddings.
  • Unity catalog: A centralized metadata management platform for data discovery and governance.
  • Prefect: An open-source workflow orchestration platform for data engineering pipelines.
  • Dag works: A cloud-based data orchestration platform for building and managing data pipelines.
  • Airflow: A popular open-source platform for programmatically authoring, scheduling, and monitoring workflows.
  • Voyage AI: A startup focusing on building a platform for autonomous vehicle development.
  • Reflection AI: A startup that uses AI to help people understand and improve their communication skills.
  • Decagon AI: A startup that builds AI models for scientific discovery.
  • Rockset: A real-time analytics database built for the cloud.
  • RockDB: A high-performance embedded key-value store.

参考文章

欢迎关注M小姐的微信公众号,了解更多中美软件、AI与创业投资的干货内容!

M小姐研习录 (ID: MissMStudy)

欢迎在评论区留下你的思考,与听友们互动。喜欢 OnBoard! 的话,也可以点击打赏,请我们喝一杯咖啡!如果你用 Apple Podcasts 收听,也请给我们一个五星好评,这对我们非常重要。

最后!快来加入Onboard!听友群,结识到高质量的听友们,我们还会组织线下主题聚会,开放实时旁听播客录制,嘉宾互动等新的尝试。添加任意一位小助手微信,onboard666, 或者 Nine_tunes,小助手会拉你进群。期待你来!

<figure></figure>
Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

EP 60. 全英文对话CRV投资人与LanceDB创始人:向量数据库下半场,大模型和多模态需要怎样的数据基建?

EP 60. 全英文对话CRV投资人与LanceDB创始人:向量数据库下半场,大模型和多模态需要怎样的数据基建?