Insights from an AI Pioneer

Metas AI Boss Says He DONE With LLMS...

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In a surprising statement, Yan Lukan, considered a godfather of AI, declared he is no longer interested in Large Language Models (LLMs), stating they are outdated compared to other AI advancements. Speaking at Nvidia's GTC 2025, Lukan stressed the importance of understanding the physical world, creating persistent memory, and improving planning and reasoning in AI. He advocated for world models over LLMs for achieving AGI, suggesting that AI needs new architectures beyond token-based systems to understand and navigate the real world effectively. His views underscore the complexity of developing truly intelligent machines and highlight ongoing efforts to introduce alternative models like VJeppa for future AI development.

Highlights

Yan Lukan dismisses LLMs as not exciting anymore, focusing on real-world understanding 🧐.
Lukan favors world models to overcome limitations of current AI systems 🌍.
He emphasizes new architectures for AI that are more aligned with human-like cognition 🧠.
Meta's emerging VJeppa model shows promise with its novel approach to video learning 🎥.
Despite LLMs' popularity, they are insufficient for achieving Artificial General Intelligence 🚫.

Key Takeaways

Yan Lukan, AI pioneer, claims LLMs are getting outdated and less exciting 🎉.
Focus is shifting to understanding the physical world, memory, reasoning, and planning in AI 🤓.
Lukan supports world models over LLMs for achieving advanced AI goals 👨‍💻.
Current LLMs may not take us to AGI; new architectures are needed 💡.
Meta's VJeppa model could be a significant step forward in AI development 🚀.

Overview

Yan Lukan, one of AI's pioneering figures, recently shook the industry with his revelation at the Nvidia GTC 2025: he's done with Large Language Models (LLMs). Highlighting their limitations, Lukan expressed a shift towards exploring new questions in AI, particularly focusing on how machines can better understand and interact with the physical world, and develop reasoning and planning abilities akin to humans. His insights reveal a paradigm shift necessary for advancing AI beyond its current capabilities.

Throughout the discussion, Lukan delved into the inadequacies of token-based systems in dealing with high-dimensional and continuous data from the real world. Instead, he proposed world models as a promising path toward developing true AGI—Artificial General Intelligence. In his view, systems like Meta's VJeppa, are structured in a manner that aligns more closely with the way humans think and learn, showing potential to predict and understand complex, real-world scenarios.

In summing up his stance, Lukan emphasized the need for AI to transcend its traditional learning models. He argued that contemporary architectures might be limiting AI instead of propelling it towards AGI. His call for innovative frameworks reflects a broader industry trend towards creating more comprehensive AI systems, capable of reasoning in abstract spaces much like a human brain, and no longer reliant solely on vast amounts of text data.

Chapters

00:00 - 00:30: Introduction and Context The chapter introduces a statement that may be surprising to some: the lessened interest in LLMs by a prominent figure in AI, Yan Lakhan, during the Nvidia GTC 2025. Despite the hype around LLMs in AI, the text focuses on Yan Lakhan's shifting perspective, positioning this as significant within the artificial intelligence discourse, particularly highlighting his esteemed status as a 'godfather of AI.'
00:30 - 01:00: Yan LeCun's Statement on LLMs The chapter discusses a statement made by Yan LeCun, an expert with extensive experience in AI research, particularly in relation to the capabilities of large language models (LLMs). The significance of his statement is highlighted, sparking curiosity about whether his perspectives are accurate. The chapter also introduces a conversation focusing on four main directions or developments in AI that have been especially exciting over the past year, as identified by Yan LeCun.
01:00 - 01:30: Challenges with LLMs and Physical World Understanding This chapter discusses some challenges with Large Language Models (LLMs), particularly in their understanding of the physical world. The speaker mentions a shift in interest away from LLMs, indicating that while they continue to be improved incrementally by industry professionals through increased data and compute power, there are more intriguing questions to explore beyond the current focus on LLMs. This suggests a need to look into areas where LLMs fall short and broader, more sophisticated advancements are needed.
01:30 - 02:00: World Models and Abstract Representations In this chapter, the discussion revolves around four main aspects of machine intelligence: understanding the physical world, establishing persistent memory, reasoning, and planning. Jensen, in a keynote, touched upon the challenge of teaching machines to comprehend the physical world. Additionally, the concept of machines having persistent memory is highlighted as a relatively less explored area. Moreover, the chapter critiques the current simplifications in how large language models (LLMs) approach reasoning, suggesting that there are more complex dimensions to consider in developing machine reasoning and planning capabilities.
02:00 - 02:30: Limitations of Current Architectures This chapter focuses on the limitations of current architectures in AI, specifically discussing the inadequacy of text as the sole world model for advanced AI systems. It highlights the excitement and potential in exploring alternative and more comprehensive world models that are currently considered obscure academic topics but may become pivotal in the tech community in the future.
02:30 - 03:00: VJEPA and Efficient Representation Learning This chapter addresses the limitations of next token prediction in artificial general intelligence (AGI), emphasizing its inadequacy for physical world applications. It questions the form of underlying models if not large language models (LLMs) for reasoning, persistent memory, and planning. The chapter introduces the concept of world models as a focal point of research in overcoming these challenges.
03:00 - 03:30: Training on Video Data and Predictive Capability This chapter discusses the concept of world models within our minds, which enable us to manipulate thoughts and predict outcomes based on our understanding of the physical world. It uses the example of a bottle and how different interactions like pushing might make it flip, slide, or pop, to explain how these mental models are formed in the early months of life.
03:30 - 04:00: System 1 and System 2 Thinking in AI This chapter discusses the differences in architectural needs for systems designed to handle real-world applications versus those dealing solely with language. The text highlights that current systems, which predict and manipulate tokens, such as language models and autonomous vehicle sensors, manage these tasks differently. Emphasis is placed on the complexity of engaging with the physical world, suggesting a significant distinction in system design between linguistic processing and real-world interaction.
04:00 - 04:30: Future of AI and Hybrid Models The chapter delves into the limitations of using tokens as a representation of the physical world since tokens are discrete and finite in number. In a large language model (LLM), the number of possible tokens is typically around 100,000, which restricts the ability to predict exact sequences. Instead, these models are trained to generate probability distributions for the subsequent tokens in a sequence.
04:30 - 05:00: Data Requirements and Limitations of Text-Based Training This chapter discusses the challenges of handling high-dimensional and continuous natural data in text-based training. The central problem is the inability to effectively build systems that understand the world by predicting videos at the pixel level. This task involves processing and predicting high-dimensional data, which current methods struggle to accomplish.

Metas AI Boss Says He DONE With LLMS... Transcription

00:00 - 00:30 But I tell you one thing which may surprise a few of you. Um, I'm not so interested in LM anymore. That was one of the statements that Yanlukan made at the Nvidia GTC 2025. And I think this clip has done the numbers on Twitter because honestly, we do know that right now in the AI space, LLMs definitely are receiving most of the hype when it comes to AI. Now, of course, if you aren't familiar with who Yan Lakhan is, he is actually one of the godfathers of AI
00:30 - 01:00 research, and he's actually been in the AI space for quite some time. This isn't someone's first rodeo when they're making statements like this. It's based on years and years of knowledge and expertise in that field. So, when someone of this level of expertise says a statement like this, it leaves a lot of people wondering, is he actually right? Take a listen to the further conversation where he talks about the four main focuses. So, um, Yan, there's been a lot of interesting things going on in the last year in AI. What has been the most exciting development in in your
01:00 - 01:30 opinion over the past year? Uh, too many to count, but I tell you one thing which may surprise a few of you. Um, I'm not so interested in LLMs anymore. you know they're kind of the last thing they are in the hands of you know industry product people kind of you know improving at the margin uh trying to get you know more data more compute generating synthetic data um I think there are more interesting questions in
01:30 - 02:00 uh four four things how you get machines to understand the physical world and Jensen talked about this this morning in this keynote how do you get get them to have persistent memory which not too many people talk about and then the last two are how do you get them to reason and plan and there is some effort of course to get you know LLMs to reason but in my opinion it's a very kind of simplistic way of uh viewing um viewing reasoning I think there are probably
02:00 - 02:30 kind of more you know better way of of doing this so um so I'm excited about things that a lot of people in this community in the tech community might get excited about five years from Um, but right now doesn't look so exciting because it's some obscure academic paper. So now this is where Yanlukan actually talks about world models and the fact that text being the only world model that you have isn't sufficient enough to basically have a world model that allows you to get to
02:30 - 03:00 AGI. He says that next token prediction is basically something that yes, it kind of works well for text, but it doesn't really work that well when it comes to actually doing things in the real physical world that humans do. But if it's not an LLM that's reasoning about the physical world and having persistent memory and planning, what is it? What is the underlying model going to be? Um, so a lot of people are working on world models, right? So what is a world model? World model is we all
03:00 - 03:30 have world models in our in our mind. This is what allows us to um kind of you know manipulate thoughts essentially. So you know we have a model of the of the current world. You you know that if I if I push on this on this bottle here from the top it's probably going to flip but if I push on it at the bottom it's it's going to slide. Um and you know if I press on it too hard it's might pop. So we have models of the physical world that we acquire in the first few months
03:30 - 04:00 of life and that's what allows us to deal with the real world and it's much more difficult to deal with the real world than to deal with language and so the the type of architectures that I think we need for systems that really can deal with the real world is completely different from the ones that we deal with at the moment right predict tokens right but tokens could be anything I mean so our you know autonomous vehicle model uses tokens tokens from the sensor and it produces tokens that drive and in some sense it's reasoning about the physical world at least where it's safe to drive and you
04:00 - 04:30 won't run into poles. Um why why aren't tokens the right way to represent the physical world? Tokens are discreet. Okay. So when we talk about token generally we talk about uh a a a finite set of possibilities. is in a typical LLM the number of possible tokens is on the order of 100 thousand or something like that right um so when you train a system to predict tokens you can never train it to predict the exact token that's going to follow a sequence in text for example but you can produce a probability distribution of all the
04:30 - 05:00 possible tokens in your dictionary you know it's just a long vector of 100 thousand numbers between zero and one that sum to one we know how to do this we don't know how to do this with u with deal with what know natural data that is highdimensional and continuous and every attempt at trying to get system to understand the world or build mental models of the world by being trained to predict videos at the pixel level
05:00 - 05:30 basically have failed um even at the even to train a system like a a neural net of some kind to learn good representations of images every technique that works works by reconstructing an image from a corrupted or transformed version of it basically has failed not completely failed. They kind of work but they don't work as well as alternative architectures that we call joint embedding which essentially don't attempt to reconstruct at the pixel level. they try to learn a
05:30 - 06:00 representation an abstract representation of the image or the video or the natural uh signal that is is being trained on so that you can make prediction in that abstract representation space. Um the example I use very often is that if I take a video of this room and I kind of pan a camera and I stop here and I ask the system to predict you know what's the continuation of that video. It's probably going to predict it's a room and there's people sitting blah blah blah. it can there's no way it can predict what every single one of you looks like, right? That's
06:00 - 06:30 completely unpredictable from the initial segment of the video. And so there's a lot of things in the world that are just not predictable. And if you train a system to predict at a pixel level, it spends all of its resources trying to come up with details that it just cannot invent. And so that's just a complete waste of resources. And every attempt that we've tried and I've been working on this for 20 years uh of training a system using self-s supervised learning by predicting video doesn't work. It only works if you do it
06:30 - 07:00 at a representation level. And what that means is that those architectures are not are not generated. So you're basically saying that that a transformer doesn't so basically what he's stating here is that you know using a transformer to basically predict the physical world just doesn't work because of the architecture. And he actually does make some key points. You know, if you're just predicting the next token, there are a lot of things that are just implicit from your understanding of the physical world and you being there and all of this reasoning that goes on in your brain that you really just take for granted. So, I do think he does make a
07:00 - 07:30 good point there that it doesn't really work. Now, of course, like I said before, I'm not just agreeing with what Yan Khan says here. There's been research papers. I've done a video on where, you know, China did a lot of research on Sora and basically said that, you know, these kind of architectures don't really predict the physical world. In fact, they I can't remember exactly what they did, but in the paper, they essentially proved that, you know, these video models aren't really predicting the physical world. It's more sort of mimicking the world based on the architecture. And it was really interesting to see that deep dive. Honestly, if you watch the video, you'll understand it a little bit more
07:30 - 08:00 in depth. Of course, it's very easy to say, okay, this doesn't work, that doesn't work. But now we have to get to the crux of the video, which is all right, we've got the knowledge that it doesn't work, but what is the solution? And this is where Yanakan talks about his famous VJ architecture. And apparently, they're actually coming out with version two. very soon and this one is probably having the most promising results out of any model so far. You know, Jensen is absolutely right that uh you get ultimately more power in a system that that can sort of you know
08:00 - 08:30 reason. I I disagree with the fact that the proper way to do reasoning is the way you know current NLMs that that have are augmented by by reasoning abilities. You're saying it works but it's not the right way. It's not the right way. I think uh you know when when we reason when we think we do this in some sort of abstract mental mental uh state that has nothing to do with language. You don't like kicking the tokens out. You want to be reasoning in your um latent space and not abstract space, right? I mean if I if I tell you you know imagine a cube floating in front of you and now rotate
08:30 - 09:00 that cube by 90 degrees around a vertical axis. Okay, you can do this mentally. It has nothing to do with language. Um you know a cat could do this. uh we can't specify the problem to a cat obviously through language but you know cats do things that are much more complex than this when they plan like you know some trajectories to jump on a piece of furniture right they they do things that are much more complex than that and um that is not related to language it's certainly not done in so you know token space which would be kind of actions it's done in sort of abstract
09:00 - 09:30 mental space so that's uh that's kind of the challenge of of the next few years um which is to figure out new architectures that allow this type of of things. That's what I've been working on for the last. So, so is there a new model we should be expecting that allows us to do reasoning in this abstract space? U it's called we call it Japa uh or JPA world models and we've you know um my my colleagues and I have kind of put out a bunch of uh papers on this kind of you know first steps towards
09:30 - 10:00 towards this over the last few years. So JPA means John predictive architecture. This is those world models that learn abstract representations and are capable of sort of manipulating those representation uh and and and perhaps reason and produce sequences of actions to you know arrive at a particular goal. I I think that's the that's the future. I wrote a long paper about this that explains uh how this might work about about three years ago. Let's actually take a look at what that vjer architecture actually looks like from the video meta released last year.
10:00 - 10:30 Today machines require thousands of examples and hours of training to learn a single concept. The goal with JEPAS which means joint embedding predictive architectures is to create highly intelligent machines that can learn as efficiently as humans. VJEP is pre-trained on video data allowing it to efficiently learn concepts about the physical world similar to how a baby learns by observing its parents. It's able to learn new concepts and solve new tasks using only a few examples without
10:30 - 11:00 full fine-tuning. VJEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space. Unlike generative approaches that try and fill in every missing pixel, VJA has the flexibility to discard irrelevant information, which leads to more efficient training. To allow our fellow researchers to build upon this work, we're publicly releasing VJeppa. We believe this work is another important step in the journey towards AI that's able to understand the world, plan, reason, predict and accomplish complex
11:00 - 11:30 tasks. The alternative that we have now is a project called VJEPA and we are at getting close to version two where basically it's one of those joint emitting predictive architecture. So it does prediction on on video but at the representation level and it seems to work really well. We have an example of this. The first version of this is trained on very short videos, just 16 frames, and it's trained to uh basically predict the representation of full video from a version of of a partially masked
11:30 - 12:00 one. And that system apparently is able to tell you whether a particular video is physically possible or not, at least in restricted cases. And it gives you a binary output. This is feasible. This is not or maybe well no it's simpler than this. You you measure the prediction error that the system produces. So you take a sliding window of those 16 frames on a video and you look at you know can you predict like the next few frames and you you measure the prediction error and when something really strange happens in the video like an object disappears or
12:00 - 12:30 changes shape or or you know something like that or sponty appears or doesn't obey physics so it's learn physically realistic just by observing videos. Yeah these are you know I mean you train it on natural videos and then you test it on synthetic video where something really weird happens right? So if you trained it on videos where really weird things happen that would become normal and it wouldn't uh yeah that's right detect those as being odd. So you don't do that. This is where Yanlukan actually talks about the system one and system two
12:30 - 13:00 thinking. Of course as humans we have two modes of thinking. The system one is pretty much reactive whereas system two is where we think about things for a longer time. And this is only the recent paradigm that LMS have recently got to. And this is what Yan Lakhan talks about when he says that AI systems are essentially missing some of those capabilities intuitively. And that's what we need to really have to have a comprehensive system that can somehow get to general AGI and it it connects with something that we're all very familiar with, right? So psychologists talk about
13:00 - 13:30 system one and system two. System one is tasks that you can accomplish without really sort of thinking about them. they they've become you become used to them and you you can accomplish them without thinking too much about them. So if you are an experienced driver you can drive even without driving assistance you can drive without thinking about it much you know you can talk to someone at the same time you can you know um etc. But if you are a a if you drive for the first time or the first few hours you are you are
13:30 - 14:00 at the wheel you have to really focus on what you're doing right and you're planning all kinds of catastrophe scenarios and stuff like that. Imagine all kind of kind of things. So that's system two. You're recruiting your entire prephoto cortex to your your world model your internal world model to uh figure out the you know what's what's going to happen and then plan action so that good things happen. Um whereas when you're familiar with this you you can just use a system one and sort of uh do this automatically. And so this this
14:00 - 14:30 idea that you start by uh you know using your world model and you're able to accomplish a task even a task that you've never encountered before. Zero shot right? You don't have to be trained to solve that task. you can just so you can just accomplish that task without learning anything just on the basis of your understanding of of the world and your planning abilities. That's what's missing in current systems. But if you accomplish that task multiple times, then eventually it gets compiled into what's called a policy,
14:30 - 15:00 right? So a sort of reactive system that allows you to just just accomplish that task without without planning. So the first thing this reasoning is system two the sort of automatic subconscious reactive policy that's system one can do system one and are trying to inching their way towards system two but ultimately I think we need a different architecture for system two so this is where we get to Yan Lehan's further statements where he talks about the fact that you know we're simply just not
15:00 - 15:30 going to get to AGI via LMS and I do somewhat agree I think that you know in the future the systems that really are general AI probably will be a hybrid of some sort. They'll be a mixture of all of those capabilities and we've actually seen, you know, AI companies move towards omnimodels. We've seen Google actually doing that recently. So, it's really interesting to see him talk about this because I don't think he's that far off the mark and it's going to be really interesting to see where the future does go. Uh, but the real world is just much more complicated. Like, okay, here here's something that you some of you
15:30 - 16:00 may have heard me say in the past. uh current LLMs are trained typically with something like on the order of 30 trillion tokens, right? Token typically is about three bytes. So that's 0.9 10 to the 13 bytes. Let's say 10 to the 14 bytes. Um that would take any of us over 400,000 years to read through that because that's kind of the totality of all the text available on the internet. Right now, a the psychologists tell us that a 4-year-old has been awake a total
16:00 - 16:30 of 16,000 hours. And we have about 2 megabytes going to our visual cortex through our optic nerve. Um, every second, 2 megabytes per second roughly. Multiply this by 16,000 hours times 3600, it's about 10^ the 14 bytes. In four years through vision, you see as much data as text that would take you 400,000 years to read. I mean,
16:30 - 17:00 that tells you we're never going to get to AGI, whatever you mean by this, u by just training from text. It's just not happening.