AI Insiders

24 Episodes

Reverse

Examples as the Prompt: A Scalable Approach for Efficient LLM Adaptation in E-commerce

2025-03-2917:09

This paper addresses the challenges associated with adapting Large Language Models (LLMs) for various tasks within the e-commerce domain using prompting techniques. While prompting offers an efficient alternative to fine-tuning, it often requires significant manual effort from domain experts for prompt engineering and frequent updates to align with evolving business needs. Furthermore, crafting truly unbiased natural language prompts and selecting representative in-context examples remain difficult for humans. The authors propose a novel framework called Examples as the Prompt (EaP). This approach leverages labelled data to enhance prompts by automatically selecting the most representative examples to maximise the few-shot learning capabilities of LLMs. EaP is designed to be efficient due to its unsupervised example selection and adaptive to potential data distribution shifts.

From Demonstrations to Rewards: Alignment Without Explicit Human Preference

2025-03-2821:07

This paper addresses a core challenge in aligning large language models (LLMs) with human preferences: the substantial data requirements and technical complexity of current state-of-the-art methods, particularly Reinforcement Learning from Human Feedback (RLHF). The authors propose a novel approach based on inverse reinforcement learning (IRL) that can learn alignment directly from demonstration data, eliminating the need for explicit human preference data required by traditional RLHF methods. This research presents a significant step towards simplifying the alignment of large language models by demonstrating that high-quality demonstration data can be effectively leveraged to learn alignment without the need for explicit and costly human preference annotations. The proposed IRL framework offers a promising alternative or complementary approach to existing RLHF methods, potentially reducing the data burden and technical complexities associated with preference collection and reward modelling.

Generative AI in Education: Impact Across Grade Levels

2025-03-2617:24

This paper investigates the impact of Generative Artificial Intelligence (GAI), such as ChatGPT, Kimi, and Doubao, on students' learning across four grade levels (high school sophomores and juniors, university juniors and seniors) in six key areas collectively termed LIPSAL: learning interest, independent learning, problem-solving, self-confidence, appropriate use, and learning enjoyment. The study employed a hybrid-survey method combining questionnaires and group interviews. Key findings indicate that GAI has a generally positive impact on all LIPSAL aspects, with the most significant influence on 'appropriate use' and 'independent learning', and the least on 'learning interest' and 'self-confidence'. University students reported a higher level across all LIPSAL aspects compared to high school students. Students hold a positive attitude towards GAI and are willing to use it, recognising its potential while also acknowledging challenges related to accuracy, over-dependence, and ethical considerations.

Flaws of Multiple-Choice Questions for Evaluating Generative AI in Medicine

2025-03-2609:25

This paper critically examines the use of multiple-choice question (MCQ) benchmarks to assess the medical knowledge and reasoning capabilities of Large Language Models (LLMs). The central argument is that high performance by LLMs on medical MCQs may be an overestimation of their true medical understanding, potentially driven by factors beyond genuine knowledge and reasoning. The authors propose and utilise a novel benchmark of paired free-response and MCQ questions (FreeMedQA) to investigate this hypothesis. This study provides compelling evidence that performance on medical MCQ benchmarks may not be a reliable indicator of the true medical knowledge and reasoning abilities of LLMs. The significant performance drop in free-response questions, coupled with the above-chance MCQ accuracy even with completely masked questions, suggests that LLMs might be exploiting the structure of MCQs rather than demonstrating genuine understanding. The findings underscore the importance of developing and utilizing more rigorous evaluation methods, such as free-response questions, to accurately assess the potential and limitations of LLMs in medical applications.

NeurIPS 2023 LLM Efficiency Fine-tuning Competition Analysis

2025-03-2512:37

This document summarises the key findings and insights from the NeurIPS 2023 Large Language Model (LLM) Efficiency Fine-tuning Competition. The competition aimed to democratise access to state-of-the-art LLMs by challenging participants to fine-tune a pre-trained model within a tight 24-hour timeframe on a single GPU. The analysis of the competition reveals a significant trend towards benchmark overfitting, highlighting the limitations of current evaluation methods. Notably, top-performing submissions prioritised data curation and the use of standard open-source libraries over custom model architectures. The competition also underscored the importance of software quality and reproducibility in the machine learning community. The organisers have released all competition entries and evaluation infrastructure to facilitate further research in this area.

Orchestrated Distributed Intelligence: A Systems Paradigm for Agentic AI

2025-03-2429:10

This briefing document reviews the main themes and important ideas presented in Krti Tallam's paper on Orchestrated Distributed Intelligence (ODI). The paper argues for a paradigm shift in the field of Agentic AI, moving away from the development of isolated autonomous agents towards the creation of integrated, orchestrated systems of agents that work collaboratively with human workflows. ODI is presented as a novel approach that combines systems theory with AI capabilities, aiming to bridge the gap between artificial and human intelligence and transition organisations from static systems of record to dynamic systems of action.

MoonCast - High-Quality Zero-Shot Podcast Generation

2025-03-2324:05

This briefing document reviews the main themes and important ideas presented in the research paper "MoonCast: High-Quality Zero-Shot Podcast Generation". The paper introduces MoonCast, a novel system designed to generate natural, multi-speaker podcast-style speech from text-only sources using the voices of unseen speakers. The key innovation lies in addressing the challenges of long speech duration and spontaneity, which are limitations of many existing text-to-speech (TTS) systems.

Superalignment with Dynamic Human Values

2025-03-2221:08

This paper addresses the critical challenges of aligning superhuman artificial intelligence (AI) with human values, specifically focusing on scalable oversight and the dynamic nature of these values. The authors argue that existing approaches, such as recursive reward modelling, which aim for scalable oversight, often remove humans from the alignment loop entirely, failing to account for the evolving nature of human preferences. To counter this, the paper proposes a novel algorithmic framework inspired by Iterated Amplification. This framework trains a superhuman reasoning model to decompose complex tasks into subtasks that can be evaluated and solved by aligned human-level AI. The central assumption of this approach is the "part-to-complete generalization hypothesis," which posits that the alignment of subtask solutions will generalize to the alignment of the complete solution. The paper outlines the proposed algorithm, discusses methods for measuring and improving this generalization, and reflects on how this framework addresses key challenges in AI alignment.

Analysis of Multi-Agent System Failures

2025-03-2125:42

The paper concludes by highlighting the introduction of MASFT as a "structured framework for understanding and mitigating MAS failures" and the development of a "scalable LLM-as-a-judge evaluation pipeline" for diagnosing failure modes. The intervention studies reveal that addressing MAS failures requires more than just simple fixes, paving a "clear roadmap for future research" focused on structural MAS redesigns. The open-sourcing of the dataset and LLM annotator further supports future work in this area. The authors note that "despite the growing interest in LLM agents, dedicated research on their failure modes is surprisingly limited," positioning their work as a "pioneering effort in studying failure modes in MASs" and underscoring the need for further research into robust evaluation metrics, common failure patterns, and effective mitigation strategies.

Conflict-Aware Meta-Review Generation via Cognitive Alignment

2025-03-2017:47

This paper addresses the challenge of automating high-stakes meta-review generation, a critical task in academic peer review that involves synthesizing conflicting evaluations and deriving consensus. The authors argue that current Large Language Model (LLM)-based methods for this task are underdeveloped and susceptible to cognitive biases like the anchoring effect and conformity bias, hindering their ability to effectively handle disagreements. To overcome these limitations, the paper introduces the Cognitive Alignment Framework (CAF), a novel dual-process architecture inspired by Kahneman's dual-process theory of human cognition. CAF employs a three-step cognitive pipeline: review initialisation, incremental integration, and cognitive alignment. Empirical validation on the PeerSum dataset demonstrates that CAF outperforms existing LLM-based methods in terms of sentiment and content consistency.

ChatGPT o3-mini vs. DeepSeek-R1 : Code-Solving Showdown

2025-03-1916:57

This briefing document summarises the key findings and implications of the research paper "A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks" by Shakya et al. The study investigates the capabilities of two leading Large Language Models (LLMs), ChatGPT o3-mini and DeepSeek-R1, in solving competitive programming problems from Codeforces. The evaluation focuses on the accuracy of solutions, memory efficiency, and runtime performance across easy, medium, and hard difficulty levels. Study Limitations: The study acknowledges several limitations: Single-shot prompting: The lack of follow-up prompts might have limited the refinement of generated outputs, as "LLM-assisted programming requires human intervention to ensure correctness". Model versions: The study used ChatGPT o3-mini but not the more programming-focused DeepSeek-Coder, which "could have demonstrated better results than the R1". Limited task set: The use of only 29 programming tasks might limit the generalisability of the results. Single programming language: Focusing solely on C++ might limit the applicability across different coding environments. Prompt formulation: While a consistent prompt was used, exploring different prompts could yield further insights. The authors suggest that future research should address these limitations by using more diverse problem sets, exploring multiple programming languages, testing different prompting strategies, and comparing more recent versions of these and other LLMs. Key Takeaways: ChatGPT o3-mini demonstrates superior performance in solving medium-difficulty competitive programming tasks compared to DeepSeek-R1 in a zero-shot setting. Both models struggle significantly with hard programming tasks, indicating the current limitations of LLMs in handling high-complexity problems without further human guidance or advanced prompting techniques. ChatGPT generally exhibits better runtime performance, while DeepSeek sometimes shows lower memory consumption, though often at the cost of correctness. The study highlights the ongoing need for human intervention and advanced prompting strategies to effectively utilise LLMs for solving programming tasks, particularly those beyond the easy difficulty level. Future research should explore the impact of different prompting techniques, model versions (like DeepSeek-Coder), a wider range of tasks and programming languages, to gain a more comprehensive understanding of LLM capabilities in code generation.

Towards AI-assisted Academic Writing

2025-03-1925:39

This paper presents components of an AI-assisted academic writing system focused on citation recommendation and introduction generation. The authors argue that scientific writing is a crucial but challenging skill, particularly for non-native English speakers and students. They explore how AI can augment the writing process by providing relevant citation suggestions based on document context and by automatically generating structured introductions that contextualise research within existing literature. The paper includes quantitative evaluations of their system's components and qualitative research into how researchers currently incorporate citations into their workflows, revealing a demand for precise and effective AI writing assistance.

Measuring AI Ability to Complete Long Tasks

2025-03-1916:26

This paper introduces a new metric, the "50%-task-completion time horizon," to quantify AI capabilities by relating AI performance on tasks to the typical time humans take to complete them. The study timed domain-expert humans on a diverse set of research and software engineering tasks (RE-Bench, HCAST, and a new suite called SWAA) and evaluated the performance of 13 frontier AI models (2019-2025) on these tasks. The key finding is that the 50% time horizon of frontier AI models has been doubling approximately every seven months since 2019, potentially accelerating in 2024. Extrapolation of this trend suggests that within five years, AI systems may be capable of automating many software tasks currently taking humans a month. The paper discusses the methodology, limitations, and implications of these findings, particularly for AI safety and governance. This paper provides a compelling new way to measure and track the progress of AI capabilities by focusing on the time horizon for task completion. The observed exponential growth, particularly the potential acceleration in recent years, has significant implications for the future of automation and AI safety. While acknowledging the limitations of current benchmarks and the challenges of extrapolating these trends to real-world scenarios, the findings suggest a rapid advancement towards AI systems capable of tackling increasingly complex and time-consuming tasks. Continued research and development of more realistic benchmarks will be crucial for accurately forecasting AI capabilities and ensuring responsible AI governance

Next-Generation Phishing: LLMs and Evasion of Phishing Defenses

2024-12-0318:37

Imagine you're getting emails or messages on the internet. Some of these messages might be from people trying to trick you - like strangers offering candy. But now, there's something new happening: Smart Computers Making Tricky Messages: New computer programs (called LLMs) can write very convincing messages These messages might look like they're from someone you know It's getting harder for safety programs to spot these tricks Safety Programs Need Help: Think of safety programs like security guards at school The old security guards are having trouble spotting the new tricks Scientists are teaching them new ways to keep us safe! How We Can Stay Safe: Always be careful with messages from people you don't know Even if a message looks real, check with your parents or teachers Remember: if something seems too good to be true, it probably is! The Good News: Scientists are using the same smart computers to make better safety tools They're teaching computers to spot these tricky messages It's like training super-smart guard dogs to protect us! The most important thing to remember is: Always talk to a grown-up you trust before clicking on links in emails Never share personal information online If you're not sure about something, it's better to ask for help! Just like how you look both ways before crossing the street, it's important to be careful and smart when using the internet!

ComfyGI: Automating Image Generation Workflow Improvement

2024-12-0216:40

Imagine you're trying to draw a picture, but instead of using crayons, you're using a special computer program. This program is called ComfyGI, and it's like having a super-smart art assistant! Here's what makes it special: Making Pictures Better Automatically: Instead of spending lots of time trying to get your picture perfect ComfyGI helps find the best way to make the picture you want It's like having a friend who knows all the best tricks to make art look amazing! How It Works: Think of it like playing a video game where you level up: The computer tries different ways to make the picture It keeps the ways that work better (like keeping the best cards in a card game) It keeps trying until the picture looks just right! Special Tools It Uses: It can change different things to make the picture better: Pick the best art-making program (like choosing the right paintbrush) Change how detailed the picture is Make the words that describe the picture better Use a smart helper to write better descriptions Why It's Cool: Makes pictures that look way better than before Saves lots of time (no more trying again and again!) Almost everyone who tested it liked the pictures it made better Think of ComfyGI like having a magical art assistant that knows exactly how to make your ideas come to life in a picture! It's like having an expert artist helping you, but it's all done by a smart computer.

Can ChatGPT Overcome Behavioral Biases in the Financial Sector?

2024-11-2912:27

Imagine you're trying to decide how to spend or save your pocket money. Sometimes, the way someone tells you about something can change how you feel about it - just like how a boring vegetable might sound tastier if it's described in a fun way! This is called the "framing effect." Now, scientists are trying to teach a smart computer program (called ChatGPT) to make better decisions about money, especially when it comes to buying and selling gold. Here's what they did: The Problem: People (and computers) sometimes make quick decisions based on how information sounds They might get too excited about good news or too worried about bad news This can lead to not-so-smart choices with money The Solution - A Special Three-Step Plan: First: Sort the news into different groups (like sorting your toys into different boxes) Second: Give each piece of news a score (like rating games out of 10) Third: Think about it again carefully (like when your parents tell you to "think twice") How It Works: It's like having a super-smart friend who: Doesn't get too excited by flashy news Thinks about the long-term (not just right now) Double-checks its decisions to make sure they're smart The Results: This new way of thinking helped the computer make better money decisions It did better than just buying and keeping gold It was smarter than simpler ways of making decisions Think of it like teaching a computer to be a wise money manager - one that doesn't get tricked by how things sound and always takes time to think carefully before making decisions!

State of AI Ethics Report, Volume 5 (July 2021)

2024-11-2820:07

Imagine we're talking about making sure robots and smart computers (AI) are good helpers for everyone in the world. Here's what smart people are thinking about: Making AI Be a Good Helper: Like when you build with LEGO, we need to think carefully about how we build AI We want AI to make good decisions and be able to explain why it made them Just like we care about our planet, we need to make sure AI doesn't waste too much energy How AI Affects People: Some people worry AI might take away jobs from humans We need to make sure AI is fair to everyone (like how teachers should treat all students fairly) We need to protect people's secrets (like how you wouldn't want someone reading your diary) Different Countries Working with AI: Countries like the USA and China are in a friendly race to build the best AI Some places have more tools to build AI than others (like how some schools might have more computers) People want to work together to make AI helpful for everyone in the world Cool Things Happening: In Montreal (a city in Canada), lots of people are working together to build smart AI People have special events called "hackathons" where they work together to solve problems with computers Scientists are finding new ways to make AI better and safer The big idea is that we want AI to be like a really good friend: Helpful and kind to everyone Honest about what it's doing Careful with people's private information Good for our planet Just like how you have rules at home and school to help everyone get along, we need rules to make sure AI helps make the world better for everyone!

Trustworthy LLM-Based Multi-Agent Systems for AI Ethics

2024-11-2713:56

Imagine we're talking about making sure robots (or AI) are good friends that we can trust. Scientists are trying to figure out how to teach these AI helpers to be honest, fair, and kind - just like how your parents and teachers teach you good values! Here's what the scientists are working on: Making Trustworthy AI Friends: They want to create AI that always tells the truth These AI should be fair to everyone They should be clear about what they're doing (no keeping secrets!) Team of AI Helpers: Instead of using just one AI, they're creating teams of AI Each AI has a special job (like how different players in a sports team have different roles) These AI work together and check each other's work (like how students might check each other's homework) Teaching AI to Be Good: Scientists are creating special rules and guidelines It's like having a rulebook that helps the AI know what's right and wrong They want to make sure the AI follows important rules, just like how we follow rules at school Challenges They Face: Sometimes it's hard to explain complicated rules to AI They need to make sure the AI can work well with human programmers They want to make sure the AI stays up-to-date with the newest rules Think of it like training a super-smart robot pet: You want it to be helpful and friendly It needs to learn right from wrong It should work well with other robot pets Most importantly, it needs to be someone you can trust! The scientists found out that when AI work as a team and talk to each other about what's right and wrong, they make better decisions - just like how you might make better decisions when you talk things through with your friends or family!

Natural Language Processing with Hugging Face

2024-11-2623:08

Imagine you have a super-smart computer friend called Hugging Face that helps you understand and work with words and languages. Here's what it can do: Cleaning Up Text: Just like how you clean up your room, Hugging Face helps clean up text It removes messy stuff (like weird symbols or extra spaces) It makes all the words neat and organized, like arranging your toys! Understanding Words: Hugging Face breaks down big sentences into smaller pieces (like breaking LEGO sets into individual blocks) It understands different languages (like having a friend who can speak many languages!) It can even learn new words it hasn't seen before Cool Things It Can Do: Read stories and tell you what they're about (like giving you a quick summary of a book) Tell if someone is happy or sad from their writing (like understanding emoji meanings!) Answer questions (like having a smart friend who helps with homework) Translate languages (like having a universal translator from sci-fi movies!) Write new text (like having a creative writing buddy) Learning and Getting Better: Just like how you learn new things at school, Hugging Face can learn to get better at specific tasks It can practice with different types of writing (like sports news or science books) The more it practices, the better it gets! Sharing with Others: Scientists and developers can share their trained Hugging Face models with others It's like sharing your toys with friends, but with smart computer programs! Everyone can work together to make these programs better Think of Hugging Face as a friendly robot librarian who's really good at reading, writing, and understanding different languages. It helps people work with text in all sorts of fun and useful ways!

Generative Agent Simulations of 1,000 People

2024-11-2515:12

Imagine scientists are trying to create special computer friends (they call them "generative agents") that can act just like real people! Here's what they did: Making Computer Friends: The scientists talked to 1,000 real people for 2 hours each They used these conversations to teach computers how to think and respond like those real people It's kind of like making digital twins of real people! How It Works: Think of it like creating characters in a video game, but these characters: Remember things from their conversations Can think about things like real people Can answer questions just like the real person would Can make decisions like real people do Testing the Computer Friends: The scientists gave these computer friends different tests to see how well they could act like real people They asked them questions about: What they think about different things in life Their personality How they would share things with others How they would make decisions Why This is Cool: Scientists can use these computer friends to: Learn about how people behave Test their ideas without bothering real people all the time Understand how different people might react in different situations Think of it like having a really smart toy that can pretend to be a real person - it remembers things, makes decisions, and acts just like the person it learned from! It's like having thousands of digital actors who can help scientists understand how people think and behave. But remember - these are still just computer programs, not real people. The scientists are very careful to use this technology safely and protect the privacy of the real people they talked to.

#box-pro-ellipsis-175926396793335{-webkit-line-clamp:2;}AI Insiders