Live heatRSS
Source

research · en · weight 1.2

  1. arXiv cs.CL/arxiv.org/
    IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language

    arXiv:2604.16654v1 Announce Type: new Abstract: Reclaimed slur usage is a common and meaningful practice online for many marginalized communities. It serves as a source of solidarity, identity, and shared experience. However, contemporary automated and AI-based moderation tools for online content largely fail to distinguish between reclaimed and hateful uses of slurs, resulting in the suppression…

    1.1#cs.cl
  2. arXiv cs.CL/arxiv.org/
    Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

    arXiv:2604.16378v1 Announce Type: new Abstract: Large language models (LLMs) and classical machine learning methods offer complementary strengths for predictive modeling, yet their fundamentally different representations and training paradigms hinder effective integration: LLMs rely on gradient-based optimization over textual data, whereas models such as Random Forests (RF) employ non-differentia…

    1.4#cs.cl#cs.lg
  3. arXiv cs.CL/arxiv.org/
    Data Mixing for Large Language Models Pretraining: A Survey and Outlook

    arXiv:2604.16380v1 Announce Type: new Abstract: Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget constraints. Unlike sample-level data selection, data mixing optimizes domain-level sampling weights to allocate limi…

    1.4#cs.cl#cs.lg
  4. arXiv cs.CL/arxiv.org/
    Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

    arXiv:2604.16368v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM frame…

    1.4#cs.cl
  5. arXiv cs.CL/arxiv.org/
    AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

    arXiv:2604.16625v1 Announce Type: new Abstract: Recent large language model (LLM) agents have shown promise in using execution feedback for test-time adaptation. However, robust self-improvement remains far from solved: most approaches still treat each problem instance independently, without accumulating reusable knowledge. This limitation is particularly pronounced in domain-specific languages s…

    1.4#cs.cl#cs.ai
  6. arXiv cs.CL/arxiv.org/
    Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning

    arXiv:2604.16622v1 Announce Type: new Abstract: Backchannels (e.g., `yeah', `mhm', and `right') are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage …

    1.4#cs.cl#cs.ai
  7. arXiv cs.CL/arxiv.org/
    Spotlights and Blindspots: Evaluation Machine-Generated Text Detection

    arXiv:2604.16607v1 Announce Type: new Abstract: With the rise of generative language models, machine-generated text detection has become a critical challenge. A wide variety of models is available, but inconsistent datasets, evaluation metrics, and assessment strategies obscure comparisons of model effectiveness. To address this, we evaluate 15 different detection models from six distinct systems…

    1.1#cs.cl#cs.ai
  8. arXiv cs.CL/arxiv.org/
    Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

    arXiv:2604.16593v1 Announce Type: new Abstract: We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiom…

    1.1#cs.cl
  9. arXiv cs.CL/arxiv.org/
    SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future

    arXiv:2604.16451v1 Announce Type: new Abstract: Recent advances in visual-language models (VLMs) have led to significant improvements in a plethora of complex multimodal tasks like image captioning, report generation, and visual perception. However, generating text from meteorological data is highly challenging because the atmosphere is a chaotic system that is rapidly changing at various spatial…

    1.1#cs.cl#cs.cv
  10. arXiv cs.CL/arxiv.org/
    HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

    arXiv:2604.16430v1 Announce Type: new Abstract: Large Language Models (LLMs) are powerful and widely adopted, but their practical impact is limited by the well-known hallucination phenomenon. While recent hallucination detection methods have made notable progress, we find most of them overlook the dynamic nature and underlying mechanisms of it. To address this gap, we propose HalluSAE, a phase tr…

    1.4#cs.cl#cs.ai
  11. arXiv cs.CL/arxiv.org/
    Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG

    arXiv:2604.16422v1 Announce Type: new Abstract: The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds kn…

    1.1#cs.cl#cs.ai
  12. arXiv cs.CL/arxiv.org/
    Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

    arXiv:2604.16376v1 Announce Type: new Abstract: This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and comp…

    1.1#cs.cl#cs.cr
  13. arXiv cs.CL/arxiv.org/
    CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

    arXiv:2604.16372v1 Announce Type: new Abstract: Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social …

    1.1#cs.cl#cs.ai
  14. arXiv cs.CL/arxiv.org/
    Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction

    arXiv:2604.16370v1 Announce Type: new Abstract: Decoding natural language from non-invasive electroencephalography (EEG) remains fundamentally limited by low signal-to-noise ratio and restricted information bandwidth. This raises a fundamental question regarding whether sentence-level linguistic structure can be reliably recovered from such signals. In this work, we suggest that this assumption m…

    1.1#cs.cl#cs.ai
  15. arXiv cs.CL/arxiv.org/
    Multimodal Claim Extraction for Fact-Checking

    arXiv:2604.16311v1 Announce Type: new Abstract: Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today's misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studie…

    1.1#cs.cl#cs.ai
  16. arXiv cs.CL/arxiv.org/
    A Community-Based Approach for Stance Distribution and Argument Organization

    arXiv:2604.16852v1 Announce Type: new Abstract: The proliferation of online debate platforms and social media has led to an unprecedented volume of argumentative content on controversial topics from multiple perspectives. While this wealth of perspectives offers opportunities for developing critical thinking and breaking filter bubbles (Pariser 2011), the sheer volume and complexity of arguments …

    1.1#cs.cl
  17. arXiv cs.CL/arxiv.org/
    DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

    arXiv:2604.16845v1 Announce Type: new Abstract: Large language models (LLMs) tuned for safety often avoid acknowledging demographic differences, even when such acknowledgment is factually correct (e.g., ancestry-based disease incidence) or contextually justified (e.g., religious hiring preferences). This identity-blindness yields incorrect responses, unnecessary refusals, or generic "equal-treatm…

    1.4#cs.cl
  18. arXiv cs.CL/arxiv.org/
    Detecting Alarming Student Verbal Responses using Text and Audio Classifier

    arXiv:2604.16717v1 Announce Type: new Abstract: This paper addresses a critical safety gap in the use Automated Verbal Response Scoring (AVRS). We present a novel hybrid framework for troubled student detection that combines a text classifier, trained to detect responses based on their content, and an audio classifier, trained to detect responses using prosodic markers. This approach overcomes ke…

    1.1#cs.cl#cs.ir
  19. arXiv cs.CL/arxiv.org/
    x1: Learning to Think Adaptively Across Languages and Cultures

    arXiv:2604.16917v1 Announce Type: new Abstract: Languages encode distinct abstractions and inductive priors, yet most large language models (LLMs) overlook this diversity by reasoning in a single dominant language. In this work, we introduce x1, a family of reasoning models that can adaptively reason in an advantageous language on a per-instance basis. To isolate the effect of reasoning-language …

    1.4#cs.cl
  20. arXiv cs.CL/arxiv.org/
    When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

    arXiv:2604.16916v1 Announce Type: new Abstract: Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a s…

    1.4#cs.cl
  21. arXiv cs.CL/arxiv.org/
    Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation

    arXiv:2604.16881v1 Announce Type: new Abstract: Cross-cultural entity translation remains challenging for large language models (LLMs) as literal or phonetic renderings are usually yielded instead of culturally appropriate translations in context. However, relevant knowledge may already be encoded in model parameters during large-scale pre-training. To incentivize the effective use of parametric …

    1.4#cs.cl#cs.ai
  22. arXiv cs.CL/arxiv.org/
    PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

    arXiv:2604.16909v1 Announce Type: new Abstract: As large language models (LLMs) evolve from conversational assistants into agents capable of handling complex tasks, they are increasingly deployed in high-risk domains. However, existing benchmarks largely rely on mixed queries and posterior evaluation, output-level scoring, which quantifies hallucination severity but offers limited insight into wh…

    1.4#cs.cl#cs.ai
  23. arXiv cs.CL/arxiv.org/
    Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

    arXiv:2604.16889v1 Announce Type: new Abstract: Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automat…

    1.1#cs.cl
  24. arXiv cs.CL/arxiv.org/
    HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

    arXiv:2604.16839v1 Announce Type: new Abstract: Long-term memory is a critical challenge for Large Language Model agents, as fixed context windows cannot preserve coherence across extended interactions. Existing memory systems represent conversation history as unstructured embedding vectors, retrieving information through semantic similarity. This paradigm fails to capture the associative structu…

    1.4#cs.cl
  25. arXiv cs.CL/arxiv.org/
    Crowded in B-Space: Calibrating Shared Directions for LoRA Merging

    arXiv:2604.16826v1 Announce Type: new Abstract: Merging separately trained LoRA adapters is a practical alternative to joint multi-task training, but it often hurts performance. Existing methods usually treat the LoRA update $\Delta W = BA$ as a single object and do not distinguish the two LoRA matrices. We show that the main source of LoRA merge interference comes from the output-side matrix $B$…

    1.1#cs.cl
  26. arXiv cs.CL/arxiv.org/
    When Informal Text Breaks NLI: Tokenization Failure, Distribution Shift, and Targeted Mitigations

    arXiv:2604.16787v1 Announce Type: new Abstract: We study how informal surface forms degrade NLI accuracy in ELECTRA-small (14M) and RoBERTa-large (355M) across four transforms applied to SNLI and MultiNLI: slang substitution, emoji replacement, Gen-Z filler tokens, and their combination. Slang substitution (replacing formal words with informal equivalents, e.g., "going to" -> "gonna", "friend" ->…

    1.1#cs.cl#cs.ai
  27. arXiv cs.CL/arxiv.org/
    StageMem: Lifecycle-Managed Memory for Language Models

    arXiv:2604.16774v1 Announce Type: new Abstract: Long-horizon language model systems increasingly rely on persistent memory, yet many current designs still treat memory primarily as a static store: write an item, place it into memory, and retrieve it later if needed. We argue that this framing does not adequately capture the practical memory-control problem in deployed LLM systems. In realistic se…

    1.4#cs.cl#cs.ai
  28. arXiv cs.CL/arxiv.org/
    When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms

    arXiv:2604.16767v1 Announce Type: new Abstract: Audio platforms have evolved beyond entertainment. They have become central to public discourse, from podcasts and radio to WhatsApp voice notes and live streams. With millions of shows and hundreds of millions of listeners, audio platforms are now a major channel for misinformation. Yet existing fact-checking pipelines are mostly designed for writt…

    1.1#cs.cl#cs.cy
  29. arXiv cs.CL/arxiv.org/
    Expressing Social Emotions: Misalignment Between LLMs and Human Cultural Emotion Norms

    arXiv:2604.16757v1 Announce Type: new Abstract: The expression of emotions that serve social purposes, such as asserting independence or fostering interdependence, is central to human interactions and varies systematically across cultures. As LLMs are increasingly used to simulate human behavior in culturally nuanced interactions, it is important to understand whether they faithfully capture huma…

    1.4#cs.cl#cs.cy
  30. arXiv cs.CL/arxiv.org/
    Evaluating Adaptive Personalization of Educational Readings with Simulated Learners

    arXiv:2604.16744v1 Announce Type: new Abstract: We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned readin…

    1.1#cs.cl#cs.ai
  31. arXiv cs.CL/arxiv.org/
    No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation

    arXiv:2604.16686v1 Announce Type: new Abstract: Large language models (LLMs) can answer questions and summarize documents when conditioned on external contexts (e.g., retrieved evidence), yet context use remains unreliable: models may overwrite an already-correct output (neutral regression) even when the context is non-informative. We formalize neutral regression as a do-no-harm requirement and q…

    1.4#cs.cl#cs.ai
  32. arXiv cs.CL/arxiv.org/
    CBRS: Cognitive Blood Request System with Bilingual Dataset and Dual-Layer Filtering for Multi-Platform Social Streams

    arXiv:2604.16665v1 Announce Type: new Abstract: Urgent blood donation seeking posts and messages on social media often go unnoticed due to the overwhelming volume of daily communications. Traditional app-based systems, reliant on manual input, struggle to reach users in low-resource settings, delaying critical responses. To address this, we introduce the Cognitive Blood Request System (CBRS), a m…

    1.1#cs.cl
  33. arXiv cs.CL/arxiv.org/
    Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

    arXiv:2604.16656v1 Announce Type: new Abstract: All languages are equal; when it comes to tokenization, some are more equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many languages written in non-Latin scripts observe a poor exchange rate: LLMs take several multiples of tokens to encode the same information in many langu…

    1.4#cs.cl
  34. arXiv cs.CL/arxiv.org/
    Migrant Voices, Local News: Insights on Bridging Community Needs with Media Content

    arXiv:2604.16651v1 Announce Type: new Abstract: Research shows news consumption differs across demographics, yet little is known about non-mainstream audiences, especially in relation to local media. Our study addresses this gap by examining how French-speaking migrants in a mid-size European city engage with local news, and whether their needs are reflected in coverage. Eight community members p…

    1.1#cs.cl
  35. arXiv cs.CL/arxiv.org/
    EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions

    arXiv:2604.16456v1 Announce Type: new Abstract: Real-time voice assistants must revise task state when users interrupt mid-response, but existing spoken-dialog benchmarks largely evaluate turn-based interaction and miss this failure mode. We introduce EchoChain, a controlled benchmark for evaluating full-duplex state-update reasoning under mid-speech interruptions. EchoChain identifies three recu…

    1.1#cs.cl#cs.ai
  36. arXiv cs.CL/arxiv.org/
    Measuring Representation Robustness in Large Language Models for Geometry

    arXiv:2604.16421v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representati…

    1.4#cs.cl#cs.ai
  37. arXiv cs.CL/arxiv.org/
    QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning

    arXiv:2604.16396v1 Announce Type: new Abstract: Islamic inheritance law (ilm al-mawar{\i}th) presents a challenging domain for evaluating large language models' structured reasoning capabilities, requiring multi-step legal analysis, rule-based blocking decisions, and precise fractional calculations. We present QU-NLP's submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasonin…

    1.1#cs.cl
  38. arXiv cs.CL/arxiv.org/
    The impact of postediting on AI generative translation in Yemeni context: Translating literary prose by ChatGPT

    arXiv:2604.16704v1 Announce Type: new Abstract: This study examines the role of artificial intelligence in translation, focusing on ChatGPT, specifically ChatGPT-4, and the extent to which human postediting is required in literary translation. A mixed-method approach was adopted, involving 30 professional translators who evaluated and postedited AI-generated translations of selected Arabic and En…

    1.4#cs.cl#cs.ai
  39. arXiv cs.CL/arxiv.org/
    LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?

    arXiv:2604.16382v1 Announce Type: new Abstract: Longitudinal NLP tasks require reasoning over temporally ordered text to detect persistence and change in human behavior and opinions. However, in-context learning with large language models struggles on tasks where models must integrate historical context, track evolving interactions, and handle rare change events. We introduce LiFT, a longitudinal…

    1.1#cs.cl
  40. arXiv cs.CL/arxiv.org/
    GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution

    arXiv:2604.16377v1 Announce Type: new Abstract: Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question: 'Who (or which LLM) wrote this piece of code?' We present…

    1.4#cs.cl#cs.cy
  41. arXiv cs.CL/arxiv.org/
    CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents

    arXiv:2604.15802v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstru…

    0.9#cs.cl
  42. arXiv cs.CL/arxiv.org/
    Why Fine-Tuning Encourages Hallucinations and How to Fix It

    arXiv:2604.15574v1 Announce Type: new Abstract: Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using …

    0.7#cs.cl#cs.ai
  43. arXiv cs.CL/arxiv.org/
    C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

    arXiv:2604.15675v1 Announce Type: new Abstract: Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating…

    0.9#cs.cl
  44. arXiv cs.CL/arxiv.org/
    HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

    arXiv:2604.15648v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understan…

    0.7#cs.cl#cs.cv
  45. arXiv cs.CL/arxiv.org/
    Applied Explainability for Large Language Models: A Comparative Study

    arXiv:2604.15371v1 Announce Type: new Abstract: Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques…

    0.9#cs.cl#cs.ai
  46. arXiv cs.CL/arxiv.org/
    Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

    arXiv:2604.15490v1 Announce Type: new Abstract: Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-…

    0.7#cs.cl
  47. arXiv cs.CL/arxiv.org/
    PolicyBank: Evolving Policy Understanding for LLM Agents

    arXiv:2604.15505v1 Announce Type: new Abstract: LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve it…

    0.9#cs.cl#cs.ai
  48. arXiv cs.CL/arxiv.org/
    LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

    arXiv:2604.15589v1 Announce Type: new Abstract: Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors …

    0.9#cs.cl#cs.ai
  49. arXiv cs.CL/arxiv.org/
    DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation

    arXiv:2604.15593v1 Announce Type: new Abstract: Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it fir…

    0.7#cs.cl#cs.ai
  50. arXiv cs.CL/arxiv.org/
    LLMs Corrupt Your Documents When You Delegate

    arXiv:2604.15597v1 Announce Type: new Abstract: Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems i…

    0.9#cs.cl#cs.hc
  51. arXiv cs.CL/arxiv.org/
    FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use

    arXiv:2604.15646v1 Announce Type: new Abstract: Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decompos…

    0.9#cs.cl
  52. arXiv cs.CL/arxiv.org/
    CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics

    arXiv:2604.15647v1 Announce Type: new Abstract: Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize C…

    0.7#cs.cl
  53. arXiv cs.CL/arxiv.org/
    Preference Estimation via Opponent Modeling in Multi-Agent Negotiation

    arXiv:2604.15687v1 Announce Type: new Abstract: Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLM…

    0.9#cs.cl
  54. arXiv cs.CL/arxiv.org/
    Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

    arXiv:2604.15701v1 Announce Type: new Abstract: The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explo…

    0.7#cs.cl
  55. arXiv cs.CL/arxiv.org/
    Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

    arXiv:2604.15503v1 Announce Type: new Abstract: Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS) -- predicting fMRI activations during reading from LM activations -- have been used to argue for a high degree of similarity. To underst…

    0.7#cs.cl
  56. arXiv cs.CL/arxiv.org/
    GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

    arXiv:2604.15715v1 Announce Type: new Abstract: The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose G…

    0.7#cs.cl#cs.ai
  57. arXiv cs.CL/arxiv.org/
    Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

    arXiv:2604.15741v1 Announce Type: new Abstract: Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens…

    0.9#cs.cl#cs.ai
  58. arXiv cs.CL/arxiv.org/
    Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

    arXiv:2604.15744v1 Announce Type: new Abstract: This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on u…

    0.7#cs.cl
  59. arXiv cs.CL/arxiv.org/
    TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

    arXiv:2604.15756v1 Announce Type: new Abstract: Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic sp…

    0.7#cs.cl#cs.cv
  60. arXiv cs.CL/arxiv.org/
    Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

    arXiv:2604.15771v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence…

    0.7#cs.cl
  61. arXiv cs.CL/arxiv.org/
    MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

    arXiv:2604.15774v1 Announce Type: new Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevoluti…

    0.9#cs.cl
  62. arXiv cs.CL/arxiv.org/
    PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

    arXiv:2604.15776v1 Announce Type: new Abstract: We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly availabl…

    0.7#cs.cl#cs.ai
  63. arXiv cs.CL/arxiv.org/
    Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)

    arXiv:2604.15547v1 Announce Type: new Abstract: The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs' inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern …

    0.9#cs.cl#cs.ai
  64. arXiv cs.CL/arxiv.org/
    Qwen3.5-Omni Technical Report

    arXiv:2604.15804v1 Announce Type: new Abstract: In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of …

    0.9#cs.cl#eess.as
  65. arXiv cs.CL/arxiv.org/
    CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

    arXiv:2604.15840v1 Announce Type: new Abstract: Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-…

    0.9#cs.cl
  66. arXiv cs.CL/arxiv.org/
    Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language

    arXiv:2604.15841v1 Announce Type: new Abstract: While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP task…

    0.9#cs.cl
  67. arXiv cs.CL/arxiv.org/
    Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

    arXiv:2604.15842v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task exe…

    0.9#cs.cl
  68. arXiv cs.CL/arxiv.org/
    CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

    arXiv:2604.15847v1 Announce Type: new Abstract: Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to …

    0.7#cs.cl
  69. arXiv cs.CL/arxiv.org/
    DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

    arXiv:2604.15866v1 Announce Type: new Abstract: Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inc…

    0.9#cs.cl#cs.ai
  70. arXiv cs.CL/arxiv.org/
    How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

    arXiv:2604.15873v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role al…

    0.9#cs.cl
  71. arXiv cs.CL/arxiv.org/
    MUSCAT: MUltilingual, SCientific ConversATion Benchmark

    arXiv:2604.15929v1 Announce Type: new Abstract: The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code…

    0.7#cs.cl
  72. arXiv cs.CL/arxiv.org/
    RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

    arXiv:2604.15945v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typical…

    0.9#cs.cl#cs.lg
  73. arXiv cs.CL/arxiv.org/
    SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification

    arXiv:2604.15998v1 Announce Type: new Abstract: Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the…

    0.7#cs.cl
  74. arXiv cs.CL/arxiv.org/
    AgentV-RL: Scaling Reward Modeling with Agentic Verifier

    arXiv:2604.16004v1 Announce Type: new Abstract: Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge…

    0.9#cs.cl#cs.ai
  75. arXiv cs.CL/arxiv.org/
    The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

    arXiv:2604.15702v1 Announce Type: new Abstract: We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive…

    0.9#cs.cl#cs.lg
  76. arXiv cs.CL/arxiv.org/
    A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

    arXiv:2604.15789v1 Announce Type: new Abstract: As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-fre…

    0.9#cs.cl
  77. arXiv cs.CL/arxiv.org/
    "Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

    arXiv:2604.15588v1 Announce Type: new Abstract: The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabSc…

    0.9#cs.cl#cs.ai
  78. arXiv cs.CL/arxiv.org/
    GroupDPO: Memory efficient Group-wise Direct Preference Optimization

    arXiv:2604.15602v1 Announce Type: new Abstract: Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work ex…

    0.9#cs.cl
  79. arXiv cs.CL/arxiv.org/
    Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

    arXiv:2604.15706v1 Announce Type: new Abstract: Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using b…

    0.7#cs.cl
  80. arXiv cs.CL/arxiv.org/
    Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

    arXiv:2604.15607v1 Announce Type: new Abstract: AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 s…

    0.7#cs.cl#cs.ai