Why Education Needs an “AI-in-the-Loop” Model

Twelve year old Leo sat in his room, staring at his history assignment on the Industrial Revolution. Usually such an assignment would take several hours of reading relevant material, picking his thesis, finding supporting evidence to back his claims and then drafting his essay. But today, he simply fed a prompt into ChatGPT, made some quick and simple revisions, and hit submit. 

On paper he looked like a star student. His essay even got him an ‘A’ from his teacher. But did any real learning take place? 

This is the crisis we face in education. We are currently obsessed with the “Human-in-the-Loop” model, where humans oversee AI outputs. But in a classroom, that model is backwards. If we want to raise a generation of innovators rather than mere prompt engineers, we have to flip the script to an “AI-in-the-Loop” approach that prioritizes the student’s cognitive effort before the algorithm ever enters the chat.

Renzulli’s Three Ring Conception of Giftedness

To understand why current AI integration is risky, we have to look at what actually creates high-level human performance. One of the most respected frameworks in educational psychology is Joseph Renzulli’s Three-Ring Conception of Giftedness.

Renzulli argued that “giftedness” (or what we might call high-level creative productivity) lies at the intersection of three distinct clusters of human traits:

  1. Above-Average Ability: The core competency and foundational knowledge in a specific domain.
  2. Creativity: The ability to generate original ideas, see new patterns, and think divergently.
  3. Task Commitment: The grit, perseverance, and “productive struggle” required to see a difficult project through to the end.

When these three rings overlap, giftedness emerges. However, the current “plug-and-play” integration of AI in schools threatens to thin out every one of these rings, potentially preventing students from achieving high accomplishments.

The Risk to Ability

The most immediate danger of AI is cognitive offloading—the tendency to use external tools to reduce the mental effort required for a task. While offloading is great for mundane chores (like GPS for driving), it is quite harmful for learning.

We already know that easy access to external information changes what people remember and how they allocate mental effort. The classic “Google effect” research showed that when people expect information to be accessible later, they are less likely to remember the information itself and more likely to remember where to find it. That isn’t necessarily bad. Humans have always used tools and transactive memory, but in schooling, ability is built through repeated retrieval, reconstruction, and sense-making.

Research has already begun to show the impact of cognitive offloading. In a large field experiment with high school math students, researchers found that a “ChatGPT-like” tutor improved performance during practice, but when the AI support was removed, those students performed worse than students who learned without the AI tool, an effect consistent with dependence and reduced durable learning.

Without guardrails, AI can act as a “cognitive crutch.” By providing hints and solving intermediate steps, it removes “desirable difficulty” or the kind of struggle that leads to deeper learning and long-term retention. If the AI does the heavy lifting, the students learn less and don’t reach higher levels of competence required for high achievements. 

The Risk To Creativity

Creativity is not just producing something. It’s producing something both novel and appropriate.

AI is remarkably good at the obvious. Large language models generate outputs that reflect patterns in their training data. That makes them useful, fluent, and fast but it also means they can pull learners toward the statistical center of what’s been said before.

A recent experiment on creative writing found that access to AI ideas helped individuals produce stories judged as more creative (especially among less creative writers) but it also made the stories more similar to one another, reducing collective novelty. In other words, AI can raise the floor while lowering the ceiling of diversity.

Separate work has begun to quantify this “echo” effect more directly, showing measurable limits on plot diversity in LLM outputs under the same prompts. And broader research reviewing LLM creativity suggests that while models can appear creative through recombination, there are persistent questions about originality, intent, and the difference between pattern completion and human creative agency. 

When students brainstorm with AI first, they often anchor on the suggestions they see and fall into an “associative rut.” If they took the time to think by themselves before reaching out to AI, they would discover more original and personally meaningful ideas. 

And we’ve seen a similar phenomenon long before AI. Research on brainstorming has shown that nominal brainstorming (individual idea generation before group discussion) produces more ideas and more original ideas than purely interactive brainstorming. In the AI context, the LLM acts like a “dominant personality” in a group meeting. It speaks first, speaks confidently, and sets a baseline. Once a student sees those AI suggestions, their brain finds it incredibly difficult to think outside those parameters.

The Risk to Commitment

Task commitment includes persistence, delayed gratification, self-regulation, and the willingness to stay with ambiguity.

Modern technology has already been impacting this ring. Research shows that higher use of digital devices is tied to concentration difficulties, lower academic performance, and poorer self-regulation. 

Now layer generative AI on top. In a world where a chatbot can produce instant essays and workable code, the emotional “cost” of effort feels high. Why wrestle with the challenge when the answer is right at the fingertips?

Emerging education research is also starting to map how generative AI intersects with self-regulated learning—highlighting both risks (over-reliance, reduced monitoring) and opportunities (scaffolds for planning, reflection, and feedback) depending on design and pedagogy. And survey-based findings have reported associations between ChatGPT use and procrastination or lower performance in some student samples, suggesting that without strong norms and supports, AI can drift from scaffold to shortcut. 

Then there’s an additional twist: changing expectations. Once AI is available, teachers and workplaces may (implicitly or explicitly) expect faster output. But speed is not the same as depth. Many breakthroughs require a long dwell time. If we compress the timeline before students have built the inner muscles of persistence, we don’t get high performers.

The Way Out

In many AI discussions, “human in the loop” sounds reassuring: the human checks the AI’s work. But in education, that framing can be backward. It puts students in the role of evaluator rather than constructor, as if learning were mainly about spotting mistakes in someone else’s thinking.

Decades of learning science tell us that durable learning is constructive and interactive. The ICAP framework, for example, shows that learning activities that are Interactive and Constructive generally outperform merely Active or Passive engagement. Students learn more when they generate, explain, debate, and build meaning, rather than just consume or lightly manipulate information.

In education, we need a “Learning-First” model that prioritizes human cognition before algorithmic assistance. A sample 5-stage framework could look like this:

Phase 1 (Individual): students write an initial thesis, solution path, or set of ideas before using AI. This protects original cognition and forces retrieval, sense-making, and ownership.

Phase 2 (Group): students critique and build together. This is where misconceptions surface and learning becomes social where students learn from each other.

Phase 3 (AI): only then does AI enter as a gap-finder, alternative perspective generator, or a Socratic questioner. It reveals elements that students might have missed and stretches their thinking.

Phase 4 (Group): the group revises their solution based on the feedback from AI, synthesizing aspects that are reasonable and rejecting those that don’t fit well.

Phase 5 (Individual): individuals reconstruct the argument/solution in their own words because self-explanation is a reliable way to accelerate understanding.

This scaffolded approach protects each ring of Renzulli’s model. Ability is built through retrieval and reconstruction. Creativity is protected through first-thought originality and peer divergence. Task commitment is strengthened through social support and reflection.

Conclusion

Education has a choice to make. Without creating the right guardrails on how to use AI, we risk teaching the  models instead of students. 

The danger of a default “human-in-the-loop” stance in classrooms is that it casts students as editors of machine work. But learning isn’t editorial. Students must build mental models, connect ideas, and develop the internal fluency that only comes from doing the cognitive work themselves.

So the guiding question for AI in education should be, “Which phase of learning does this tool strengthen and which phase might it accidentally replace?”

If we stay student-first and learning-first, we’ll use AI the way every great teacher uses support: not to remove the mountain, but to help students become the kind of climbers who can scale it, long after the tool is gone.

The “Tiny Team” Organization Is Here and It’s Redrawing the Management Map

In the past year or two, the business world has felt more like a bumpy ride than a smooth “transformation.” Employees are dealing with a lot of uncertainty—roles changing or getting eliminated entirely, teams shuffling, and rules shifting mid-game. But leaders aren’t operating from a crystal-clear blueprint either. Many are making big cuts, not just because AI speeds things up, but because they honestly can’t see what the company should look like long-term. So, they reduce costs and complexity first, then plan to rebuild smarter.

The tricky part is that AI isn’t a neat replacement for people or their jobs. It absolutely makes many tasks faster, but it also creates entirely new work. Think about customer support: many companies use chatbots to handle volume, but now someone has to watch performance, check logs to find problems, fine-tune the prompts and rules, and constantly improve the system. The work doesn’t disappear; it just moves and changes form.

Still, one advantage is unmistakable. For decades, the biggest hidden cost in any company was the Coordination Tax. You know the drill: an engineer builds a feature, then hands it to a product manager, who syncs with marketing, who waits for a business dev lead to find a partner. Every handoff is a friction point and every meeting is a tax on productivity. But now, a single “pod” of 5 to 10 people—comprising a mix of engineering, design, and growth talent—can now execute faster than a 50-person department.

2026 looks like the year that some of these trends become normal: smaller, multidisciplinary, autonomous feature teams, and flatter organizations where the middle-management role evolves from “traffic cop” to “coach + systems designer.”

Trend 1: Smaller, multidisciplinary “feature teams” become the default

As AI lifts individual capability, it becomes feasible to staff product work like a small startup.

Instead of a large, functionally segmented machine, you get teams of 5–10 people who can handle a feature from the initial idea all the way through building, shipping, and improving it. Speed shoots up because there is less coordination and fewer approvals needed. Quality rises because feedback loops tighten.

We can already see executives publicly describing this “tiny team” dynamic.

  • Mark Zuckerberg recently noted that AI lets “a single very talented person” tackle projects that used to need “big teams,” and he’s actively pushing to “flatten teams.”
  • Tobias Lütke, Shopify’s CEO, gave his teams a powerful signal: before asking for more people, you have to prove why you “cannot get what you want done using AI.” This directly asks them to think of AI as a part of their team that handles work autonomously.
  • Duolingo’s CEO, Luis von Ahn, sees AI as a “platform shift” and is adjusting things like hiring and performance. They’re also reducing contractor work where AI can step in—another clear move toward “smaller teams that deliver much more.”

The common thread in these examples isn’t just “AI is useful.” It’s that they are rethinking the basic rules of how the company operates. The organization shifts from shuffling work between departments to empowering small, focused teams to fully own their results.

Ever watched a two-person team crank out something amazing over a weekend and thought, “How did they move so fast?” Now, imagine that kind of efficiency happening across dozens of teams, each one fully supported by AI.

That’s the emerging design pattern.

Trend 2: Flatter orgs and middle management under pressure

Once small, autonomous teams are successful, a second order effect kicks in: you just don’t need as many layers of management to keep them in sync. This is the point where the discussion about “flattening” an organization becomes very real, and uncomfortable.

Experts studying the workforce have been tracking “delayering” everywhere, not just in tech. Korn Ferry, for instance, talks about companies “thinning out their management midsections,” pointing out that middle managers were a big part of 2024 layoffs, a clear evidence that the traditional manager role is under structural pressure.

And major corporations are openly saying their restructures are about cutting bureaucracy. Amazon’s corporate layoffs, for instance, were reported as an effort to reduce organizational layers and operate more efficiently.

So, the pattern we should expect to continue seeing in 2026 isn’t a world with “no managers.” It’s a world with fewer managers whose fundamental job is completely different.

Which brings up an important, yet simple, question: If AI reduces coordination work, what exactly should managers coordinate?

The manager’s role is evolving

In most modern orgs, managers have been doing (at least) three different things:

  1. People development: coaching, feedback, growth, hiring, conflict navigation, culture
  2. Technical/project leadership: running execution, reviewing work, unblocking tasks, prioritizing
  3. Systems and strategy: setting guardrails, aligning across teams, shaping operating systems, long-term planning

AI and autonomous feature teams change the distribution of these responsibilities.

1) People development becomes more important, not less

As teams become more independent and things change faster, people really need grounding: clear direction, a safe space to work, ways to grow, and honest feedback. AI can help write a review but it can’t handle the human side of building trust, shaping identity, and finding meaning in our work.

2) Day-to-day technical leadership moves closer to the team

Here’s where the scope for many middle managers gets a little smaller. When you have a focused feature team, the day-to-day execution leadership often happens right within the group—think a senior engineer, the product and design leads, and a shared AI process. The manager’s job shifts away from being the daily air traffic controller.

3) Systems-level strategy becomes the manager’s differentiator

As pods proliferate, someone must design the system those pods operate within: how often do they operate, what are the quality rules, how do we control for risk, what are the portfolio priorities, and how do these teams talk to each other?

This is exactly the direction highlighted by McKinsey & Company in its writing on “agentic” organizations: we’ll see more M-shaped supervisors (broad generalists orchestrating agents and hybrid work across domains) alongside T-shaped experts (deep specialists safeguarding quality and exceptions). 


As agents take on more execution, managers are freed up from admin tasks. Their focus is shifting toward leading people and orchestrating these blended systems. In other words, the ideal talent profile is changing. It’s less about being “the smartest technical person in the room” and much more about emotional intelligence and the ability to think strategically and connect the dots.

The emerging operating model

So what does this look like in practice? Expect more organizations to formalize patterns like:

Autonomous Feature Pods: Small teams of about five to ten people, each with a crystal-clear mission, like “improve customer sign-ups.” These teams are accountable for everything, from building to shipping to measuring success. They have all the necessary skills embedded right in the pod—product, design, engineering, and data. What makes them so fast? They use AI agents as a built-in assistant for everything from research and drafting to testing and analysis.

Thinner Management Layers: This shift also means a leaner leadership structure. You’ll see fewer layers of management. Instead, managers will take on a wider scope, focusing less on directing tasks and more on coaching and ensuring the system is running smoothly. We’ll also see more senior, non-managerial roles, like staff or principal engineers, who provide technical leadership without adding more hierarchy.

Guardrails over Gates: Finally, the way work is governed is changing from “approval-heavy” to “principle-driven.” Instead of waiting for sign-offs on every step, teams operate within clear “guardrails”—security protocols, data policies, quality standards, and ethical rules. This allows teams to move much faster and ship products without constant delays.

The trend for the rest of 2026 is clear: Organizations will continue to shrink in headcount but explode in impact. We are moving away from the “industrial” model of management, where people were cogs in a machine, toward a “biological” model, where small, autonomous cells work together to create a living, breathing, and highly adaptive organism.

Beyond the Automation Trap: Why AI Needs Values

In 1997, after Gary Kasparov lost his historic chess match to IBM’s Deep Blue, he didn’t just walk away or rail against the machine. Instead, he started a new kind of competition called “Advanced Chess.” In these matches, a human player and a computer worked together as a team—a “Centaur.”

What happened next was quite unexpected. Amateur players with midrange computers often beat grandmasters and higher end chess computers. They knew when to listen to the machine and when to override it. They used the computer to explore possibilities, but they used their human judgment to make the final call.

In other words, the most powerful force wasn’t the smartest machine but the best collaboration.

Today, we are at a similar crossroads with Artificial Intelligence. We’ve built the machines, but we haven’t quite figured out how to be Centaurs. And that might be why AI adoption is stalling.

The Diffusion Mystery

If you look at the headlines, AI is taking over the world. But if you look at the data, the picture is more complicated.

Everett Rogers, the legendary sociologist who gave us the “Diffusion of Innovations” theory, taught us that technology doesn’t spread because it’s better. It spreads because it fits into our lives, our norms, and our trust networks. Right now, AI has a fit problem.

According to McKinsey’s 2025 global research, while almost every company is playing with AI, very few have successfully scaled it. The problem might be the kinds of problems we are trying to solve with AI. It’s not that the technology is too complex, it’s that we’re trying to use a “tame” solution for a “wicked” world.

Tame Tasks vs. Wicked Problems

In the 1970s, design theorists Horst Rittel and Melvin Webber identified two types of challenges:

  1. Tame Problems: These have a clear goal and a clear stopping rule. Think of a puzzle or a math equation. Coding is often a tame problem. You write the script, you run the test, and it either works or it doesn’t. This is why AI adoption has worked quite well for developers.
  2. Wicked Problems: These are messy. They have no clear definition and no right answer, only “better” or “worse” ones. Moreover, every time you try to solve a wicked problem, the problem changes. Think of education, healthcare, or leading a team.

When we try to use AI to solve a wicked problem through pure automation, we fail because wicked problems require judgment, and good judgment requires something else.

Turbulent Fields

Systems theorist Eric Trist called the environment we live in today a “turbulent field.” Imagine trying to play a game of soccer, but the grass is moving, the goals are shifting, and the other team keeps changing the rules. That’s turbulence. And turbulence creates wicked problems. 

In a stable world, you can rely on data and optimization. But in a turbulent world more data often leads to more confusion. Instead of data, you need a North Star that can simplify the number of variables you need to optimize for. Trist argued that values are effective North Stars in solving such complex problems. They clarify direction by eliminating options that don’t fit within those values.

This might be one reason why solving problems with AI is so challenging. Without clearly defined values, AI becomes a black box that’s hard to trust.

Designing with Values

If we want AI to actually work for us, we have to stop designing for automation and start designing for human flourishing.

This brings us to one of the most important frameworks in social science that I have seen to be highly effective: Self-Determination Theory (SDT). For people to be at their best, they need three things:

  • Autonomy: The desire to be the author of our work and lives.
  • Mastery (or Competence): The urge to learn new things and get better at skills that matter.
  • Purpose (or Relatedness): The yearning to do what we do in the service of something larger than ourselves.

The “Automation Trap” kills all three. If an AI writes your entire report, you lose your autonomy (you’re just a spectator). You lose your mastery (your skills begin to atrophy). And eventually, you lose your sense of purpose.

This is the “Irony of Automation.” As researcher Lisanne Bainbridge pointed out, the more we automate, the more we rely on humans to handle the rare, high-stakes crises. But if the human has been sidelined by the automation, they no longer have the skills to save the day when the machine fails.

Nowhere is this tension clearer than in the classroom. If a student uses AI to generate an essay, the task is finished, but the learning never happened.

Learning requires productive struggle. Elizabeth and Robert Bjork’s research on “desirable difficulties” shows that we learn best when the process feels a little bit hard. When we remove the struggle, we remove the growth.

If we want AI to diffuse in education, and for that matter, in any knowledge-work field, we have to move from “Answer Engines” to “Thought Partners.”

A New Blueprint for the AI Collaborator

So, what does a value-driven, human-centered AI look like? It follows a different set of design principles:

1. Values Over Vibes

Wicked problems are resolved by making choices based on what we value most. An AI collaborator shouldn’t hide these choices. It should surface them. Instead of saying “Here is the best strategy,” it should say “If you value speed, do X; if you value employee well-being, do Y.”

2. Design for Mastery

Success shouldn’t just be measured by task completion. It should be measured by capability gained. Does this AI help the user understand the problem better? Does it challenge their assumptions? A great AI should function like a coach, nudging the user to do their best thinking rather than doing the thinking for them.

3. Human Stewardship

In a turbulent field, the “correct” answer is often a conversation. AI can widen our options and test our scenarios, but humans must steward the meaning. We are the ones who decide which values are important and trade-offs are worth making.

The Question for 2026

As we stare down 2026, we need to stop asking, “What can AI do?” and start asking, “What values should do the steering?”

For the last two years, we’ve been obsessed with technical possibilities. We’ve treated AI like a new engine and spent all our time seeing how fast it can go. But in a turbulent field, speed without a North Star is just a faster way to get lost. If we continue to design simply because a solution is possible, we will keep falling into the Automation Trap.

The truth is, technological possibility should never precede moral clarity. In the era of wicked problems, the right answer doesn’t exist in the data; it exists in our intentions. If we want to move from “Answer Engines” to true Centaur-style collaboration, we have to identify the values we are designing for before we write a single line of code.

The real lesson of Gary Kasparov’s Centaurs wasn’t that they had better computers. It was that they had a better process rooted in human judgment. In the long run, the real competitive advantage won’t be the machine’s speed. It will be our wisdom.

Designing Human-AI Workflows for Synergy

A sobering meta-analysis reveals a counterintuitive truth: most human-AI collaborations actually underperform compared to either the human or the AI working alone. Consider a study on fake hotel review detection: the AI achieved 73% accuracy, the human 55%, yet the combined system managed only 69%.

This raises a crucial question: How do we architect human-AI collaborations that truly elevate performance?

If you’re leading an AI rollout, that question is more than academic. It determines whether your investment produces step-change performance or an expensive stalemate. Simply placing people and AI in the same workflow does not guarantee better results. What matters is what they do together, and how intentionally you design the collaboration.

Synergy Vs. Augmentation

The researchers in the study above investigated two desirable outcomes: synergy and augmentation. Synergy represents the ideal state, where the combined human-AI performance surpasses both the human alone and the AI alone, mirroring “strong synergy” found in purely human groups. Augmentation is a more modest goal and simply means the human-AI system performs better than the human alone.

A common implicit assumption is that the combined system must be better than either component, but the reality is often complicated by human behavioral pitfalls. 

Humans frequently struggle to find the right balance of trust: they either over-rely on AI and blindly accept its suggestions, or under-rely and prematurely dismiss valuable AI input. For example, in the fake hotel review study, since humans were not as good at the task as AI, they didn’t make good judges of AI’s recommendations leading to a sub-par outcome. So, while AI augmented human accuracy (55% -> 69%), it was less effective than AI alone (73%)

On the other hand, in a study on bird image classification, AI was only 73% accurate, compared to expert human performance of 81%. But, the human-AI collaboration reached 90% accuracy, better than either human or AI alone. This is an example of human-AI synergy, which results from expert humans being able to better decide when to trust their own judgement versus the algorithm’s, thus improving the overall system performance.  

Complementarity

Another view of synergy in human-AI collaboration comes from research on complementarity, a practical way to ensure that what the human brings and what the AI brings are meaningfully different and mutually enhancing.

According to the authors, it is useful to think of the distinct ways humans and AI approach decision-making, which result from two key asymmetries:

  1. Information Asymmetry: Often, AI and humans operate with different inputs. AI relies on a vast collection of digitized data. Humans, however, draw on a broader, richer context that includes non-digitized real-world knowledge. For example, an AI might accurately diagnose from a scan, but a human doctor also factors in the patient’s demeanor, additional symptoms, or prior history. This holistic view gives the human a distinct informational advantage in complex situations.
  2. Capability Asymmetry: Even given the exact same information, the processing methods differ. AI models infer patterns from vast datasets while humans use more flexible mental models to build an intuitive understanding of the world. This allows humans to learn rapidly, often after only a few trials, and accumulate lifelong experiences. On the other hand, AI can instantly digest massive information and detect tiny, subtle variations in data that would be imperceptible to a human. These differences lead to different unique capabilities.

Where teams stumble is when these asymmetries are flattened. If your process gives humans and models the same inputs and asks them to do the same step, one of them is redundant. If, instead, you assign different roles and design a clean way for their contributions to combine, the whole becomes greater than the sum of its parts.

When, and When Not, To Use AI

Rethinking the architecture of modern work to integrate human and artificial intelligence demands a careful, nuanced approach. The success of this collaboration hinges on a thoughtful consideration of two critical factors: the inherent nature of the task and the complementary strengths of the human and AI partners.

Task Type

The type of work at hand fundamentally dictates the potential for synergy. For example, the MIT meta-analysis study found that Innovative Tasks are the “sweet spot” for maximum impact. These are characterized by open-ended goals and constraints that evolve in real-time, through iteration and exploration. Here, humans have the natural ability to think in non-linear ways, associating unrelated concepts to forge novel and meaningful connections. For such tasks, AI can tap into its vast informational landscape from which these connections can be drawn, leading to synergy.

However, for Decision Tasks that primarily require evaluation or judgement, the story is not as straightforward. In some cases, the human-AI collaboration can perform worse than either human or AI alone. It depends on what the task is and how it’s split between AI and humans as discussed next.

Task Separation

Underlying all successful partnerships in general is the ability to leverage distinct and complementary strengths. Human-AI collaboration is no different in that way.

It sounds counterintuitive, but if the AI’s initial performance is overwhelmingly superior, the overall human-AI system may actually perform worse. The thoughtful approach, therefore, is to precisely restrict the AI’s role to only those sub-tasks where it has a clear advantage.

Conversely, when the human is the stronger initial decision-maker, the partnership tends toward greater success. As the expert, the human is better positioned to critically assess the AI’s input and selectively integrate it into the process, in a synergistic fashion.

Ultimately, effective human-AI collaboration is not about replacing one with the other; it is a delicate exercise in defining boundaries, recognizing unique excellence, and ensuring that the final output is greater than the sum of its different parts.

Conclusion

Simply introducing AI into a workflow is not a prescription for performance improvement and can, in fact, lead to an expensive underperformance. Achieving true synergy—where the human-AI system surpasses both components working alone—requires intentional design built on complementarity. 

A crucial lesson is that human expertise matters. Experts are better positioned to critically assess and leverage AI input, transforming a simple augmentation into a synergistic gain. This is particularly relevant in the ‘sweet spot’ of Innovative Tasks, where human creative thinking abilities and an AI’s vast informational landscape can combine to give surprising creative breakthroughs.

Ultimately, the future of work hinges on recognizing and embracing these boundaries, and recognizing that AI may not be suitable for every kind of task. Instead of replacing humans, the goal is to define complementary roles that exploit the inherent asymmetries in information and capability. By expertly differentiating tasks organizations can move past the common trap of underperformance toward a future of genuine human-AI synergy.

Boosting AI’s Intelligence with Metacognitive Primitives

Over the past year or so, AI experts, like Ilya Sutskever in his Neurips 2024 talk, have been raising concerns that AI reasoning might be hitting a wall. It seems that simply throwing more data and computing power at the problem is giving us less and less in return, and models are struggling with complex thinking tasks. Maybe it’s time to explore other facets of human reasoning and intelligence, rather than just relying on sheer computational force.

At its core, a key part of human intelligence is our ability to pick out just the right information from our memories to help us solve the problem at hand. For instance, imagine a toddler seeing a puppy in a park. If they’ve never encountered a puppy before, they might feel a bit scared or unsure. But if they’ve seen their friend playing with their puppy, or watched their neighbors’ dogs, they can draw on those experiences and decide to go ahead and pet the new puppy. As we get older, we start doing this for much more intricate situations – we take ideas from one area and apply them to another when the patterns fit. In essence, we have a vast collection of knowledge (made up of information and experiences), and to solve a problem, we first need to identify the useful subset of that knowledge.

Think of current large language models (LLMs) as having absorbed the entire knowledge base of human-created artifacts – text, images, code, and even elements of audio and video through transcripts. Because they’re essentially predictive engines trained to forecast the next word or “token,” they exhibit a basic level of reasoning that comes from the statistical structures within the data, rather than deliberate thought. What has been truly remarkable about LLMs is that this extensive “knowledge layer” is really good at exhibiting basic reasoning skills just by statistical prediction. 

Beyond this statistical stage of reasoning, prompting techniques, like assigning a specific role to the LLM, improve reasoning abilities even more. Intuitively speaking, they work because they help the LLM focus on the more relevant parts of its network or data, which in turn enhances the quality of the information it uses. More advanced strategies, such as Chain-of-Thought or Tree-of-Thoughts prompting, mirror human reasoning by guiding the LLM to use a more structured, multi-step approach to traverse its knowledge bank in more efficient ways. One way to think about these strategies is as higher-level approaches that dictate how to proceed. A fitting name for this level might be the Executive Strategy Layer – this is where the planning, exploration, self-checking, and control policies reside, much like the executive network in human brains.

However, it seems current research might be missing another layer: a middle layer of metacognitive primitives. Think of these as simple, reusable patterns of thought that can be called upon and combined to boost reasoning, no matter the topic. You could imagine it this way: while the executive strategy layer helps an AI break down a task into smaller steps, the metacognitive primitive layer makes sure each of those mini-steps is solved in the smartest way possible. This layer might involve asking the AI to find similarities or differences between two ideas, move between different levels of abstraction, connect distant concepts, or even look for counter-examples. These strategies go beyond just statistical prediction and offer new ways of thinking that act as building blocks for more complex reasoning. It’s quite likely that building this layer of thinking will significantly improve what the Executive Strategy Layer can achieve.

To understand what these core metacognitive ideas might look like, it’s helpful to consider how we teach human intelligence. In schools, we don’t just teach facts; we also help students develop ways of thinking that they can use across many different subjects. For instance, Bloom’s revised taxonomy outlines levels of thinking, from simply remembering and understanding, all the way up to analyzing, evaluating, and creating. Similarly, Sternberg’s theory of successful intelligence combines analytical, creative, and practical abilities. Within each of these categories, there are simpler thought patterns. For example, smaller cognitive actions like “compare and contrast,” “change the level of abstraction,” or “find an analogy” play an important role in analytical and creative thinking.

The exact position of these thought patterns in a taxonomy is less important than making sure learners acquire these modes of thinking and can combine them in adaptable ways.

As an example, one primitive that is central to creative thinking is associative thinking — connecting two distant or unrelated concepts. In a study last year, we showed that by simply asking an LLM to incorporate a random concept, we could measurably increase the originality of its outputs across tasks like product design, storytelling, and marketing. In other words, by turning on a single primitive, we can actually change the kinds of ideas the model explores and make it more creative. We can make a similar argument for compare–contrast as a primitive that works across different subjects: by looking at important aspects and finding “surprising similarities or differences,” we might get better, more reasoned responses. As we standardize these kinds of primitives, we can combine them within higher-order strategies to achieve reasoning that is both more reliable and easier to understand.

In summary, giving today’s AI systems a metacognitive-primitives layer—positioned between the knowledge base and the Executive Strategy Layer—might provide a practical way to achieve stronger reasoning. The knowledge layer provides the content; the primitives layer supplies the cognitive moves; and the executive layer plans, sequences, and monitors those moves. This three-part structure mirrors how human expertise develops: it’s not just about knowing more, or only planning better, but about having the right units of thought to analyze, evaluate, and create across various situations. If we give LLMs explicit access to these units, we can expect improvements in their ability to generalize, self-correct, be creative, and be more transparent, moving them from simply predicting text toward truly adaptive intelligence.