On 20 July 2023, Max Lorenz, a tech guru at AI startup and co-organizer FOSSG, participated in the keynote of this Midsummer Open Source Night, presenting his thoughts on AI and open source technologies. Here's what he said.
2023年7月20日,人工智能创业公司的技术专家,自由开源新加坡组织者之一,Max Lorenz,参加本次仲夏开源之夜的主题演讲,介绍了他对人工智能和开源技术的思考。以下是他的演讲内容。
max的直播视频:
Max originally hails from Germany and had been in Singapore since 2018. Now happily married and an ardent supporter of open source, he had been working in an AI startup specializing in LLM (Large Language Model) and also part of the organizer team of FOSSG.
Max 最初来自德国,自2018年以来一直在新加坡工作。现在他已经结婚,是开源的狂热支持者,曾在一家专门从事 LLM(大型语言模型)的人工智能创业公司工作,并且是 FOSSG 组织团队的一员。
Full text of the speech
演讲全文
I'm Max and so yeah I'm basically,I've been working in Singapore for five years on AI products and so sometime ago I was thinking why don't I just start my own startup right now,it's the perfect time.This seems to be so much hop around AI products compared to 5 years ago.Now you can actually pitch it to companies,and people are interested to hear about it,and so basically what I want to do the fancy part,which is easier to sell,which is the demo part.
I want to tell you what happened after I demote and decided to turn something into a product,because I think this is a topic that not everyone touches on all the time.If it works in LangChain with one or two documents,it's great!If you put it into production, it doesn't always work,exactly the magic that you expected it to.So the first thing that I noticed.
I just basically followed some the tutorials,I've been using GPT3 for one and a half years before.So I kind knew what I was getting into when I pitch my ideas.And I was like see here this can write emails for you,you can upload your PDF.Everything that we've seen from the first talk,but then once I uploaded like 100 pdfs,cementic search,actually did I perform how I expected it and so the people using it were very confused they're like the demo it looked so good.
Why is it not working with my data?Is there something wrong?We know the customer what you can do.
And the second part is sometimes there were so many edge cases, so it doesn't matter how intricate my GPT4 prompt, and in the end it was never up to the task.There were always so many edge cases,and it didn't work here,so sunny was clear,maybe one prompt is not enough,maybe one search is not enough,and especially with LangChain.
My next issue was very slow to this day I use LaughChain or I use it all the time,but mostly for demo purposes if I put into production where you really care about how fast can this product work,how accurate can it work.I usually start just fighting stuff from scratch.
So the first issue that we encountered as the starter was how we have all the data, how do we store this,do we just use the typical vectoral database that everyone seems to be raving about these days,and they certainly have the great technology,but for us,we need a more sophisticated approach with a lot more control.So the first thing that we tried,maybe we can like fine tune GPT.Of course,that's not exactly how it works.Because you have so many new documents coming in all the time and even fine tuning doesn't retain all the information.So we need to start the data.There're so many databases to choose and what we figured out is very quick,but companies don't care about documents in a site.They don't wanna find like a sentence or a snippet in a document all the time,except for a few select use cases.Companies usually care about we have assets.
we have companies that we talk to we have competitories,and they want to know that your product can handle to the best of your abilities.So you need to find a way to store them in a separate way,search them in a separate way,and handle them differently.
So how do we get the data right now that we know?We have probably more than one database,I'm gonna show you later.But we're using postgrades where you can combine a traditional and normal database with vector search.
How do we feed the data?In many companies have 1,000 PDF files,word documents,Excel sheets and images,sort out to the first talk LLM.They have so many images with descriptions and titles.what do people really care about if they search should the title match,should the text match or should the image match,or as a mix of control.So first of all you need to figure out which is the best embedding model that you can use.For all your different modalities.
How can you use traditional search engines?Normal BM25 show the next light show do you extract the metada data from your data,from your documents.So if you're using Crms company names and so long story short,we discovered we've been just using postgress for now,which has an amazing open source community behind different plugins as well.
PGvector is great if you want to do vector search.For embeddings,we have keyword search that works well.And if you look for embeddings,there's a couple of modern papers,like the E5 for Microsoft is really really good.So if you're still using GPT for your embeddings,for the Ada embeddings,for our use cases,if I have performed that by huge margin,if you have enough data,if you have the thousands of documents and above fine-tuning(the embedding model),actually using huge returns fine-tuning tends to like fall short,when the people know exactly what they're looking for.And they're using Google style varies.I just want to have this one piece of information,then cementic search is not ideal so that why we also combine it with traditional just keyboard search.BM25 programs are really well easy to improve with recency and sources, so most people care about new information more.
If you have a huge confluent page again hundreds of pages,putting a higher value on a more modern or more recent editor for documents,it's great. There's also re-ranking models which are expensive, but basically, if you look at the past hundred search results and you need another AI model to re-rank how this snippet is to answer the question that the user has improved set as well.This one covers all the search force.But the next question is that we have all the documents.
Let's say acid that a company and people equery for these things,we also use knowledge graphs,so basically we store every person,every entity,every acid in our database with some vector.Once we ingest a document we store and when people ask for something,we give them the ability to scope their search.They say that I want to know about competitor X,then we can limit our documents to where the competitor X actually is mentioned in.
All right,now,the next or the last step is okay.Now you have amazing search right and you can base on the search results,and base on some other stuff,you can run different prompts.
How do you put it into production?How do you make sure you catch all the error cases?How do you make sure it gets better over time?
First thing, we use a lot of classification, so if people ask different things, sometimes they just want to chat,and other stuff you want to be able to categorize that.If you ask GPT what AOB is, this text usually means that the accuracy is not high and is very slow.So we usually build our own models.
There's a couple of open source projects using GPT.Forward to generate artificial training data auto labels,great product can recommend great open source tool,next thing you want to call to API.if you have a CRM,you want to get the latest version of acid,you should use function calling models like the GPT.Currently there was like announced a couple weeks back,if you just asked GPT to return jason agent,sometimes hallucinates fields.Whereas the dedicated models are restricted to only output"tokens that are valid for that specific chasing call",so this is really easy to integrate.New API's way and last thing for me are the game changers that monitoring you always want to record everything that's going on,and you want to use typical data science methods,you want to capture.How is it received by the user?Can we get any metrics to see?How are they using the product?Sometimes just a thumbs up and thumbs down button is good enough to capture trends, weights and biases.
Having a great integration for LungChain where you can capture every single step.What is the agent doing?Is it here you can see?You get all the settings from GPT4.So if someone changes the GPT4 settings, you want to record and you want to see if it increases or decreases some electric over time,that's it.
Thank you very much!
Group photo
Partners
Scan the code to get the complete PPT
扫码获取完整PPT
About KCC Singapore
KCC Singapore, founded on July 20, 2023, is the first step in the open source community's global strategy. Our mission is to empower developers to embrace and contribute to open source. Through partnerships with universities, tech companies, and government departments, we aim to promote open source adoption in Singapore's digital economy. Working closely with local open source communities and forging global connections, we amplify the voice of Chinese open source. Together, we empower open source for a brighter digital future.
作者丨KCC@新加坡
编辑丨翁培培
相关阅读 | Related Reading
历史与今天的交融:KCC@杭州Meetup圆满完成
开源社KCC@新加坡成立啦!
开源社简介
开源社成立于 2014 年,是由志愿贡献于开源事业的个人成员,依 “贡献、共识、共治” 原则所组成,始终维持厂商中立、公益、非营利的特点,是最早以 “开源治理、国际接轨、社区发展、项目孵化” 为使命的开源社区联合体。开源社积极与支持开源的社区、企业以及政府相关单位紧密合作,以 “立足中国、贡献全球” 为愿景,旨在共创健康可持续发展的开源生态,推动中国开源社区成为全球开源体系的积极参与及贡献者。
2017 年,开源社转型为完全由个人成员组成,参照 ASF 等国际顶级开源基金会的治理模式运作。近九年来,链接了数万名开源人,集聚了上千名社区成员及志愿者、海内外数百位讲师,合作了数百家赞助、媒体、社区伙伴。
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。