📄 arXiv 最新论文 (29 条)
👤 Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso et al.
📂 cs.CL
Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human pe...
👤 Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik
📂 cs.CL, cs.SE
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they ca...
👤 Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas et al.
📂 cs.AI
Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research ques...
👤 Chee Wei Tan, Yuchen Wang, Shangxin Guo
📂 cs.AI
This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this p...
👤 Praval Sharma, Ashok Samal, Leen-Kiat Soh, Deepti Joshi
📂 cs.CL
Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore...
👤 Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu et al.
📂 cs.CL, cs.AI, cs.LG
Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user ...
👤 Yuto Nishida, Naoki Shikoda, Yosuke Kishinami, Ryo Fujii et al.
📂 cs.CL
Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing n...
👤 Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Wenchao Dong et al.
📂 cs.CL
Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they enco...
👤 Naheed Rayhan, Sohely Jahan
📂 cs.CR, cs.AI
Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new ...
👤 Haolin Zhang, William Reber, Yuxuan Zhang, Guofei Gu et al.
📂 cs.CR, cs.AI
Modern phishing campaigns increasingly evade snapshot-based URL classifiers using interaction gates (e.g., checkbox/slider challenges), delayed content rendering, and logo-less credential harvesters. ...
👤 Anuj Sadani, Deepak Kumar
📂 cs.AI
The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidde...
👤 Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan et al.
📂 cs.AI, cs.CL, cs.MA
Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communica...
👤 Hans Ole Hatzel, Ekaterina Artemova, Haimo Paul Stiemer, Evelyn Gius et al.
📂 cs.CL
We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification pr...
👤 Minji Jung, Minjae Lee, Yejin Kim, Sarang Choi et al.
📂 cs.AI, cs.CY, cs.HC
LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the dive...
👤 Guangxiang Zhao, Qilong Shi, Xusen Xiao, Xiangzheng Zhang et al.
📂 cs.AI
Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills dist...
👤 Joseba Fernandez de Landa, Carla Perez-Almendros, Jose Camacho-Collados
📂 cs.CL, cs.AI, cs.CY
LLMs have been showing limitations when it comes to cultural coverage and competence, and in some cases show regional biases such as amplifying Western and Anglocentric viewpoints. While there have be...
👤 Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong et al.
📂 cs.CL, cs.AI, cs.IR, cs.LG, cs.MA
Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approa...
👤 Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei et al.
📂 cs.CL, cs.AI, cs.CE
LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experie...
👤 Minh Duc Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager et al.
📂 cs.CL, cs.SE
Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bia...
👤 Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang et al.
📂 cs.CR, cs.AI, cs.CL
The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor at...
👤 Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou et al.
📂 cs.AI, cs.CL
Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a criti...
👤 Lester James V. Miranda, Songbo Hu, Roi Reichart, Anna Korhonen
📂 cs.CL, cs.CY
Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardwa...
👤 Hao-Yuan Chen
📂 cs.CL, cs.AI
Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision...
👤 Kaushitha Silva, Srinath Perera
📂 cs.SE, cs.AI
Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally co...
👤 Linjuan Wu, Haoran Wei, Jialong Tang, Shuang Luo et al.
📂 cs.CL
As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that...
👤 Maximilian Westermann, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu et al.
📂 cs.AI, cs.CE, cs.LG
Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and...
👤 Milan De Koning, Ali Asgari, Pouria Derakhshanfar, Annibale Panichella
📂 cs.SE, cs.AI
LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may m...
👤 Chris Schneider, Philipp Schoenegger, Ben Bariach
📂 cs.AI, cs.LG
Current model training approaches incorporate user information directly into shared weights, making individual data removal computationally infeasible without retraining. This paper presents a three-l...
👤 Rodrigo Nogueira, Giovana Kerche Bonás, Thales Sales Almeida, Andrea Roque et al.
📂 cs.CL
Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions abou...
🤗 HuggingFace 热门论文 (20 条)
👤 Yueyang Ding, HaoPeng Zhang, Rui Dai · 👍 75
Comprehensive understanding of time series remains a significant challenge for Large Language Models (LLMs). Current research is hindered by fragmented task definitions and benchmarks with inherent am...
👤 Kanzhi Cheng, Zehao Li, Zheng Ma · 👍 27
Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% s...
👤 Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta · 👍 15
Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert ro...
👤 Yen-Siang Wu, Rundong Luo, Jingsen Zhu · 👍 13
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention ...
👤 Xiyang Wu, Zongxia Li, Guangyao Shi · 👍 10
Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, an...
👤 Wenhong Zhu, Ruobing Xie, Rui Wang · 👍 9
Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and ...
👤 Jun Wang, Ziyin Zhang, Rui Wang · 👍 9
Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user ...
👤 Skylar Zhai, Jingcheng Liang, Dongyeop Kang · 👍 8
Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Exist...
👤 Valentin Gabeur, Shangbang Long, Songyou Peng · 👍 7
Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoni...
👤 Qijun Han, Haoqin Tu, Zijun Wang · 👍 5
Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same f...
👤 Ceyuan Yang, Zhijie Lin, Yang Zhao · 👍 5
We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context ...
👤 Juyong Jiang, Chenglin Cai, Chansung Park · 👍 3
While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Exis...
👤 Yanran Zhang, Wenzhao Zheng, Yifei Li · 👍 3
In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved dis...
👤 Hardy Chen, Nancy Lau, Haoqin Tu · 👍 3
Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file...
👤 Vipula Rawte, Ryan Rossi, Franck Dernoncourt · 👍 2
Large Language Models (LLMs) have demonstrated remarkable fluency and versatility across a wide range of NLP tasks, yet they remain prone to factual inaccuracies and hallucinations. This limitation po...
👤 Noah Flynn · 👍 2
Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To...
👤 Benjamin K. Johnson, Thomas Goralski, Ayush Semwal · 👍 2
Semi-Markov Conditional Random Fields (semi-CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference over segment-level features and principled uncer...
👤 Yao Zhang, Zhuchenyang Liu, Thomas Ploetz · 👍 1
The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answeri...
👤 Jaechul Roh, Amir Houmansadr · 👍 1
Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples caus...
👤 Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina · 👍 0
Personalizing language models by effectively incorporating user interaction history remains a central challenge in the development of adaptive AI systems. While large language models (LLMs), combined ...
🔥 GitHub 热门项目 (7 条)
⭐ 8,845 (+706 today) · TypeScript
Code search MCP for Claude Code. Make entire codebase the context for any coding agent.
⭐ 2,821 (+228 today) · Python
Build your own AI SRE agents. The open source toolkit for the AI era ✨
⭐ 107,319 (+203 today) · Python
100+ AI Agent & RAG apps you can actually run — clone, customize, ship.
⭐ 31,240 (+188 today) · Python
LLM驱动的 A/H/美股智能分析器:多数据源行情 + 实时新闻 + LLM决策仪表盘 + 多渠道推送,零成本定时运行,纯白嫖. LLM-powered stock analysis system for A/H/US markets.
⭐ 7,772 (+49 today) · Python
An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.
⭐ 25,535 (+20 today) · Python
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
⭐ 8,965 (+17 today) · Python
A collection of sample agents built with Agent Development Kit (ADK)