Artificial intelligence has automated parts of scientific work for years, but not the full research cycle. The AI Scientist is an AI system that generates research ideas, runs experiments, writes papers and reviews the results. It was developed for machine learning research and tested in two modes: one using human-provided code templates and one using a template-free, agentic search process. In an external test, three generated papers were submitted to a workshop at ICLR 2025. One scored above the workshop’s average acceptance threshold, although all AI-generated submissions were withdrawn after peer review under a pre-established protocol.

 

How the System Operates

The AI Scientist works through four phases. It first builds an archive of research directions and hypotheses within a chosen machine learning subfield. For each direction, it generates a title, explains the idea and sets out an experimental plan. It then checks novelty by using literature search tools, including the Semantic Scholar application programming interface and web access, so that ideas that too closely resemble existing work can be discarded before further development.

 

Must Read: LLMs Show Accuracy in Radiology Numerical Tasks

 

The next phase executes experiments and prepares results for later writing. In the template-based setting, the system starts from code that reproduces a training run from a popular algorithm and then follows the proposed plan in linear order. In the template-free setting, it writes an initial script itself and then refines that script through further optimisation. That mode uses extra test-time compute and an agentic tree search structured into preliminary investigation, hyperparameter tuning, research agenda execution and ablation studies. After each experiment, it records notes in the style of an experimental journal to support later planning and drafting.

 

It then writes a machine learning conference manuscript section by section in LaTeX, using its notes and plots. To build the related work section and insert citations, it queries Semantic Scholar for relevant literature and compares those results against the generated manuscript over 20 rounds. For each potential citation, it produces a textual justification to guide how that reference should be used. The completed manuscript then moves into automated review.

 

Review Results and External Submission

The Automated Reviewer produces numerical scores for soundness, presentation, contribution, overall quality and reviewer confidence. It also lists strengths and weaknesses and issues an accept-or-reject decision. Its pipeline uses an ensemble of five reviews followed by a meta-review in which the model acts as an area chair and makes a final decision based on all five reviews.

 

Its decisions were compared with publicly available ICLR data and with human reviewer consistency measured in the NeurIPS 2021 consistency study. Agreement between automated and human assessments was reported as comparable with inter-human agreement measured by F1 score and balanced accuracy. On papers dated before the likely knowledge cutoff, balanced accuracy reached 69%. On papers from 2025, beyond the likely cutoff, balanced accuracy was 66%. The lower figure after the cutoff was presented as a sign that some data contamination may have existed before the cutoff, although performance after the cutoff remained comparable with that of human reviewers.

 

The generated papers were also assessed across a range of language models used as the core model within the system. Quality increased as the underlying models improved over time. More compute per paper also improved quality. The chart on page 5 shows rising paper scores as the number of experimental nodes increases, indicating better results with deeper test-time search budgets. With recent model generations, average output approached borderline acceptability for machine learning conference workshops.

 

Peer Review Outcomes and Constraints

A stronger test came through submission of three generated manuscripts to the peer-review process of the I Can’t Believe It’s Not Better workshop at ICLR 2025. The process took place with institutional review board approval and with the cooperation of ICLR leadership and workshop organisers. The template-free version of the system was prompted with the workshop’s broad theme and then used to generate ideas, experiments and papers. Human filtering selected the most promising outputs for submission on the basis of topic alignment, correct code execution and correct manuscript formatting. Ideation, coding and manuscript writing for the selected submissions proceeded without human modification. Reviewers were told that some submissions were AI-generated, but not which ones.

 

One of the three manuscripts received reviewer scores of 6, 7 and 6, giving an average of 6.33. That result placed it above the workshop’s average acceptance threshold. The organisers indicated that it would in all likelihood have been accepted had it not been withdrawn because it was AI-generated. The accepted manuscript reported a negative result, matching the workshop’s focus on interesting negative results. The other two papers did not reach the acceptance bar. Internal review by human AI researchers concluded that one submission met workshop level, but none met the higher threshold required for a main ICLR conference paper.

Limits remained clear. Only one of three workshop submissions cleared the threshold, and workshops had a higher acceptance rate than main conferences, with 70% for the ICLR 2025 ICBINB workshop and 32% for the ICLR 2025 main conference. Common failure modes included underdeveloped ideas, incorrect implementation of the main idea, lack of deep methodological rigour, experimental errors, duplicated figures and hallucinations such as inaccurate citations. At present, the system is limited to computational experiments.

 

The reported results showed that one automated system could complete a multi-step machine learning research workflow, generate full manuscripts and produce one workshop submission that scored above an external acceptance threshold. The same results also showed clear limits. Only one of three submitted papers reached that level, none met the standard for a main ICLR conference publication and several weaknesses remained across ideation, implementation and reporting. Broader concerns include pressure on peer review, inflation of research credentials, reuse of others’ ideas without proper credit, elimination of scientist jobs and unethical or dangerous experiments.

 

Source: Nature

Image Credit: iStock


References:

Lu Ch, Lu C, Lange RT et al. (2026) Towards end-to-end automation of AI research. Nature; 651: 914–919.




Latest Articles

AI Scientist, automated research, machine learning AI, ICLR 2025, AI research papers, agentic AI, scientific automation, AI peer review AI Scientist automates research cycles, generating ideas, experiments and papers, with ICLR 2025 tests showing promise and clear limitations.