
Hello, I'm Nathanaël Beau.
Ph.D. researcher with a focus on code generation and neural network architectures, blending academic research with practical engineering approaches in AI and NLP.
About me
I am a recent Ph.D. graduate with a solid foundation in both academic and industry environments, bridging technical complexity with strategic vision. With a background in engineering, I approach problem-solving with a pragmatic and inquisitive mindset, translating complex issues into actionable roadmaps and sustainable solutions.
My research centers on advancing code generation techniques, from model development to evaluation metrics. Throughout my Ph.D., Code Generation from Natural Language Descriptions, I designed innovative neural architectures to ensure syntactic code validity while leveraging insights from both linguistics and computing. I have authored multiple publications in top-tier NLP conferences and developed a Retrieval-Augmented Generation (RAG) model that outperforms leading solutions like Mistral and Copilot on code-assist datasets. Additionally, I contributed to the creation of a new dataset for fine-tuning, evaluating, and enhancing NL-to-Python code generation with comprehensive unit test coverage. I am eager to continue exploring areas such as LLM memorization, alternative training strategies for long-term objective alignment, and scalable model solutions.
Projects
Code Insight Dataset
Introduce a dataset of 3,409 examples for code generation, featuring clarified intents, code snippets, and related unit tests. Covers libraries such as Pandas, Numpy, and Regex, refined for reduced data contamination. Evaluated on models like Mistral 7B, CodeLLaMa 13B, Starcoder 15B, and GPT-4 to highlight model strengths and weaknesses in coding tasks
- code generation
- Python
- NLP
- Dataset

BertranX
I propose an architecture for semantic parsing, translating English to Python code snippet, ensuring syntactic validity. We evaluate its strengths and weaknesses on two development aid datasets, Django and CoNaLa. We give an implementation of the transition system to construct Abstract Syntax Trees which are then deterministically transformer into valid Python code.
- Code generation
- Semantic parsing
- Python
- NLP

grammarBERT
grammarBERT is a BERT-based encoder model specifically designed to generate Abstract Syntax Trees (ASTs) from derivation sequences of programming languages. By leveraging BERT's transformer architecture, grammarBERT captures the syntactic and semantic intricacies of code, enabling a deep understanding of programming structures.
- BERT
- Semantic parsing
- AST
- Code generation

LM for planning
Study of pre-training affects the exploration dynamics of language models in reinforcement learning tasks. Focusing on a basic arithmetic task, we propose a modification to the KL divergence penalty that more effectively balances exploration and proximity to the pre-trained model, improving the model’s ability to optimize long-term goal.
- LM
- Planning
- RL

Articles
CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow
Published on 2024
Location: Paris, France
Authors: Nathanaël Beau and Benoît Crabbé
The impact of lexical and grammatical processing on generating code from natural language
Published on 2022
Location: Paris, France
Authors: Nathanaël Beau and Benoît Crabbé
Focus area
- Code Generation
- Semantic Parsing
- Low-Resource Environment
- LLM with RL
- Grokking
My experience
PhD Candidate in NLP - Onepoint
Paris, France
Conducted doctoral research on 'Generating Python code from a Natural Language description' as part of the CIFRE program. Developed and trained the RETROcode model, a RAG model for code assistance, achieving competitive results with top LLMs such as Codex and Mistral. Created the BertranX model for syntactic validity in code generation and led a team to develop the CodeInsight dataset. Conducted recurrent training on large language models for clients, including Chanel.
Oct. 2020 - Sept. 2023R&D Data Scientist - Onepoint
Paris Area, France
Preparation for CIFRE thesis on NLP in collaboration with Université de Paris, supervised by Mr. Crabbé and Mr. Lemberger. Developed state-of-the-art models and compiled research on Python code generation from natural language descriptions, presenting scientific articles monthly to onepoint collaborators.
Jan. 2020 - Oct. 2020Data Science Advisor - GMS Consulting
Rabat, Morocco
Designed and implemented a one-week Data Science training for the Moroccan Ministry of Public Transport. The program was tailored for non-experts, focusing on data science fundamentals and applications in public transport.
Nov. 2019 - Jan. 2020Data Scientist Intern - Weave
Paris Area, France
Performed research in NLP using BERT for FAQ management and implemented a conversational agent using RASA. Compared methods for efficiently retrieving answers from a question-answer database.
March 2019 - Aug. 2019Intern Developer - EI-Technologies
Paris Area, France
Developed automated XML schema translation between MOSS SAS and THALES using XSLT. Created a customer service interface with Angular for GrasSavoye, implementing a ChatBot. Collaborated within a team of six.
June 2018 - Sept. 2018