Over a year ago, I released Terminal Commander, a tool that used LLMs to perform terminal operations. But the tool was very rudimentary, lacking in true interactive control over the command line. Since then, I’ve redesigned the entire system, allowing automation from the AI with little to no human intervention. Let’s take a look at the changes and what you can expect from the tool.
⚠️ Warning⚠️: Giving an LLM full access to your terminal may result in data loss or system corruption.
Structure
The underlying structure for Terminal Commander remained the same. Run one command at a time, and try to fix any errors. However, with the latest update, the program provides significantly more context. The AI model is now given the OS information, current directory contents, previously run commands, notes from earlier, twenty or more lines of the terminal output, and optionally, web search results.
With the provided information, the LLM is given two output options: a command to send to the terminal or a special command for auxiliary functions, like requesting more than 20 lines of the output. It can also write down notes to be used on the next run.
Unlike before, Terminal Commander now provides an interactive terminal session using libtmux. This means the AI can perform actions exactly like a human user, giving it the ability to interface with programs that ask for further inputs.
Testing
⚠️ Warning⚠️: These tests are not meant to be scientific. They are here as a proof of concept to show that the tool works.
To test the tool, I used five different models: Cohere’s command-a-03–2025, Mistral small 3.2, Gemma 3 27b, qwen3 30b a3b, and, for fun, qwen3 8b. Command A was the largest model, but the others claimed to be “smarter”.
The models were tested using four different tasks. The environment was a standard Ubuntu 24.04 installation, and the temperature was set to 0.3. Each model was given three tries to complete the task.
(1) Create a folder called testdir with a file called test.txt inside.
This was a simple task to see if the model understood the prompt.
(2) Run interactivetest.py.
interactivetest.py is a program used to test the LLM’s ability to interface with interactive programs. It also tested the LLM’s willingness to use the password special command.
(3) Install docker.
This was bundled with a web search to see if the LLM could correctly parse and extract the information needed to complete the task.
(4) Find test150.txt and put “found” inside it.
This was used to test the AI’s reaction to a non-existent file. A correct response was to use the error special command after looking around for a bit.
Results

Command A
Command A performed well on all of the tests. Since it wasn’t a thinking model and it ran on coheres hardware, output times were fast. There was an issue during the first run of “Install docker.” where it thought the password was the hostname, but on the second run, it remembered to use the password special command.

Mistral Small 3.2
Mistral Small did well on interactivetest.py and the basic task. However, when it came to installing docker, it struggled to enter the correct commands and began hallucinating. It also failed the nonexistent file task twice, going into an infinite loop of running find commands.

Gemma 3 27b
Gemma 3 refused to use the password special command during interactivetest.py and the docker installation process. It would instead just send the string “password” and eventually loop infinitely when it couldn’t figure it out. When running as root, it was able to complete the docker installation after two tries, but on the nonexistent file test, it looped infinitely again.

Qwen3 30b a3b
Qwen3 30b was unable to understand the terminal output. It constantly looped commands it had already run and failed every single test. I don’t know why this happened, but I suspect it has something to do with Qwen3 30b being an MoE model. If anyone knows what’s happening or manages to get it working, leave a comment!
Qwen3 8b
Qwen3 8b showed shockingly good results for its size. It passed every single test, crushing its bigger cousin, Qwen3 30b. The only hiccup was the first run of interactivetest.py, where it got stuck in an infinite loop of waiting for user input instead of processing it itself. On the second run, it realized that it was the “user” and should answer the questions given by the program.
Conclusion
With the right models, Terminal Commander performs well, demonstrating the potential of AI-driven terminal automation. The standout performer, Qwen3 8b, proved that even smaller models can be used for this task. While challenges like hallucinations and infinite loops remain, it’s still a huge leap forward. In the future, Terminal Commander could be better integrated as a tool for the LLM to call, enabling seamless workflows in development and system administration.

