Using AI to migrate years of Enzyme Units Tests to React Testing Library
Migrating hundreds of unit tests from Enzyme to React Testing Library (RTL) was a task I could no longer avoid. Enzyme, the once de-facto choice for testing React applications, has been dead for a long time, clinging on through unofficial React 17 support and the large swath of tests it left behind. Enzyme will never work with React 18 or any future version, and with React 19 just around the corner, we were eager to pay down this debt.
Our codebase is complex, with hundreds of non-trivial tests written over many years. While we had chipped away at migrating tests to RTL piecemeal, we needed to rip off the bandaid and make the complete transition to keep our codebase up-to-date and healthy.
The problem? This seemed like a daunting, labor-intensive task that could tie up limited resources for months. To make matters worse, RTL and Enzyme follow fundamentally different philosophies, so the migration required a deeper understanding than just adjusting syntax.
The solution was to use AI to automate and streamline the migration process. Over three weeks, with less than a week of actual effort, I migrated hundreds of complex tests using an LLM, some Python scripts, and incremental fine-tuning.
Other engineering teams have also faced this problem. This year, The New York Times Engineering team wrote about their migration to React 18, which first required the conversion of all their Enzyme tests. In their words, “this might have been the biggest piece of the entire project,” and was accomplished over multiple months. Slack Engineering also wrote about their process recently and how they used ASTs and LLMs to convert 15,000 tests.
Although the rest of this article discusses the conversion of Enzyme to RTL, it is not meant to be a deep dive. In fact, the problem of converting Enzyme to RTL is the least interesting thing here. What is exciting is how LLMs and a small amount of automation can streamline repetitive tasks in a way that wouldn’t have been possible before. This isn’t new information— the power of AI has been discussed ad nauseam — however, I hope to show that with an LLM and a little bit of glue, it’s easy to go from testing an idea to applying it at scale.
Key Challenges
Transitioning from Enzyme presented several challenges:
- Resource constraints: With limited resources, we couldn’t afford to dedicate significant time to a manual migration effort.
- Philosophical differences: RTL and Enzyme follow fundamentally different philosophies — beyond just syntax — which means adapting tests to RTL requires more than straightforward code transformations.
- Complex and lengthy tests: Many tests spanned hundreds of lines, making them time-consuming to convert manually.
- Heavy use of shallow rendering: Our tests relied heavily on Enzyme’s shallow rendering, whereas RTL executes the component life-cycle. This causes additional code to be executed in the test environment, which can require additional work to handle (e.g. mocking).
The Plan
From the outset, I wanted to see if I could use LLMs to streamline the conversion. I had tried this in the past, using GPT-3.5 Turbo, with mixed results. The model would struggle with details and ignore seemingly clear instructions, especially when parsing longer code blocks. This could have been PEBKAC, though — I have a lot more experience building LLM applications and fine-tuning prompts than I did 1.5 years ago.
Confident that GPT-4o would perform significantly better, I validated this using a starting prompt and ChatGPT to convert a few sample tests, all of which looked promising. The LLM followed the detailed instructions much more consistently than GPT-3.5.
I was also curious if fine-tuning the base model would improve its output. This comes with a catch, though. Fine-tuning typically requires an extensive training set to be effective, and I only had a handful of manually converted tests done as part of earlier efforts. While I had no way to prove the effectiveness of a fine-tuned model ahead of time, I saw it as a good learning opportunity as long as I could automate the fine-tuning in a way that required little effort. I used the existing examples to create a fine-tuned model (using the OpenAI dashboard to initiate the job) and compared it side-by-side with the base model. As expected, the limited dataset did not have an impact on the output, positive or negative.
With these initial tests out of the way, I felt confident that an LLM-based workflow would be worthwhile as long as it was lightweight, automated, and supported an iterative way of working.
Tooling
I created several Python scripts to automate the process. These are all simple scripts executed via the command line.
- analyze_files.py: The first step was understanding the scope of the work. To do this, I wrote a script to analyze the source code and output a list of Enzyme specs ordered by the number of tests in each file. The script used an Abstract Syntax Tree (AST) to ensure the accurate inspection of files.
- convert_spec_files.py: This script uses LangChain to make LLM calls to OpenAI. The calls include a system and user prompt instructing the LLM to convert the test from Enzyme to RTL. The converted file is saved in place for manual review and updates. The script accepts a list of file paths as input, which it converts in parallel.
- update_training_data.py: Uses two git hashes to find modified spec files, copying the “before” and “after” versions to the training data directory. This makes updating the training data a trivial task.
- generate_fine_tuning_data.py: This script generates a JSONL file for fine-tuning the model. Each line is a training example that shows how the LLM should respond to a given input. This script uses the system and user prompts, and training data to generate the output.
Workflow and Process
The migration used the following workflow:
- Batch processing: Tests were migrated in small batches (for example, ten at a time) to keep the review process manageable (
convert_spec_files.py
). I started with the specs containing the smallest number of tests. These were faster to convert and allowed me to build up training data more quickly. - Iterative review: Each converted test was manually reviewed and updated to pass. The amount of work required to finalize a test varied widely — some needed no changes, others required adding
data-testid
props to components, and a few required reworking the approach entirely due to RTL’s different philosophy. - Incremental fine-tuning: Migrated tests were copied into the training data directory (
update_training_data.py
) so that a new training file could be generated(generate_fine_tuning_data.py
). This file was then used to incrementally fine-tune the base model, improving performance over time. - Prompt updates: As new patterns emerged, I fine-tuned the user prompt and added some few-shot examples to provide additional context to the LLM.
- Merging: Batches of migrated tests were periodically merged into the main branch.
Results
I completed the migration over a 3-week period, working part-time. It helped that I was familiar with a good deal of the codebase. However, there were many components I had never touched, which required some time to become familiar with. The workflow described above reduced the rote work significantly, allowing me to focus on the difficult problems and ensuring the migrated tests didn’t just pass but were actually useful.
A manual migration would have taken significantly longer, likely 2–3 months of work, and required other projects to be put on hold. In addition to this time saving, the migration was incredibly cheap — the total API and training fees were less than $25.
Final thoughts:
- The LLM handled repetitive tasks like updating imports, adjusting syntax, and replacing Enzyme’s
simulate
with RTL’sfireEvent
efficiently, freeing me to focus on fixing edge cases. - While the base model handled conversions well, the fine-tuned model excelled with
async
tests and maintaining stylistic consistency. A fine-tuned model isn’t strictly necessary if your prompts and examples are well written, but fine-tuning did provide better output towards the end of the project. - Unsurprisingly, the LLM struggled with migrating concepts that went against RTL’s philosophy, for example, modifying an instance’s state or testing props. For the most part, these required manual updates to reconfigure the test. However, including some few-shot examples in the prompt did help, especially with prop checking.
The small amount of effort required to implement the workflow was well worth the investment. Keeping things simple to start and working in batches allowed me to iteratively improve the process. By starting with the easiest tests first, I was able to make a lot of progress quickly (a big motivation boost) and incorporate learning (and data) back into the workflow. When I got to the most complex tests, everything was working well, and the process was significantly easier than it would have otherwise been.
Want to learn more about testing? Take a look at my step-by-step guide on TDD with React, Jest and React Testing Library.