Evaluating Paid vs Local LLMs for QA Test Design: A Practical Comparison
R&D

Evaluating Paid vs Local LLMs for QA Test Design: A Practical Comparison

Listen to the article 22 min
Many test engineers find themselves restricted from using popular AI services like ChatGPT or Gemini due to corporate security policies or budget limitations. In this article, we are going to explore local LLMs, compare them with paid models such as GPT-4.1, GPT-5.2, and GPT-4o, and examine how test engineers can benefit from using local LLMs.

If you’re a developer, a QA engineer, or anyone facing a long list of tickets, you know the feeling. You’re looking at a blank screen with a ton of work ahead, maybe it’s test cases or documentation, and you realise ChatGPT could handle it in seconds.

It’s tempting to take the easy way, isn’t it? You know the tool exists and how to use it, but then you hit a wall, because the company’s security rules stop you from sharing internal data on public clouds. Or maybe you’re a startup and the finance team just saw the GPT-4 API bill and shut it down. It’s a common problem in today’s offices: everyone wants the benefits of AI, but privacy rules or budgets get in the way.

That brings us to the main point of this article: a comparison of paid models, such as GPT-4 and GPT-5.2, with free, locally hosted open-source models that you can run on your own hardware. We’ll focus on how they perform in quality assurance and test design. AI-powered testing is especially useful for web and mobile apps because they need thorough testing on many devices and browsers to deliver a great user experience. The main question is whether a locally hosted model on an average computer can really compete with the paid options.

We ran these experiments on a machine that is much more realistic for a standard corporate environment. The setup included an Intel Xeon CPU running at 2 GHz and 32 GB of RAM without a GPU or advanced graphics card hardware that most companies can easily provision.

Quality assurance
Quality assurance
Key takeaways
  • Ollama works best for developers who need automation and API integration. LM Studio is better for those who want to explore interactively and experiment with prompts. Both run models locally to keep your data private.
  • Paid models like GPT-4.1 and GPT-5.2 usually deliver better quality, speed, and follow rules more closely than free local models. Still, local LLMs are a good option when cloud services aren’t allowed due to security rules or budget limits.
  • For QA tasks, 7B parameter models strike the best balance between resource requirements and performance.

How to choose your local LLM?

If you want to run large language models on your own computer without sending data to the cloud, there are two main options to consider: Ollama and LM Studio. Both let you use LLMs locally, which is helpful if your company blocks cloud services like ChatGPT or when budgets are tight. However, each tool is designed for a different kind of user.

Ollama

Ollama is a local LLM runtime mainly aimed at developers.

It runs models in the background and exposes a REST API that works much like the OpenAI API, just fully local. This makes it a good fit for automation, integrations, agents, MCP servers, and CI/CD pipelines. It also handles concurrent requests better, which matters if multiple tools or scripts need to use the model at the same time.

A simple way to think about Ollama is: “Docker for LLMs.” It also provides a graphical user interface, but it is very simple and limited in functionality

LM Studio

LM Studio is a desktop application with a graphical interface.

It focuses on ease of use. You can browse available models, download them, and start chatting right away in a chat-style UI. It’s great for trying out prompts, experimenting with different models, and doing demos. LM Studio also exposes a REST API, but automation and backend usage are not its main strengths. This makes LM Studio more suitable for individual or private use rather than large-scale automation.

A good mental model here is: “ChatGPT, but running locally.

Which one to pick?

  • Ollama is the better choice if you're a developer who needs automation, integrations, or programmatic access.
  • LM Studio is best if you want to explore models, learn how they behave, or tune prompts interactively.

Both tools run models locally, keep your data private, and support popular open-source LLMs. The difference is not in what models they run, but in how you plan to use them.

AI for software testing: approach and results of ELEKS’ experiment

For the experiments, we used the same prompt to generate 10 test cases and a separate prompt to generate 10 checklists. Below you will find a detailed user story, the exact prompts used, and sample outputs for each run.

To measure time-related metrics, we used a custom script that calls both the OpenAI API and the Ollama API running locally on a virtual machine. This allowed us to compare response times under the same conditions.

Results of running prompts (test design) on different models

The table below summarises the captured metrics. We measured execution time using a generated script (TypeScript + OpenAI API, Ollama API). To calculate the quality and cost score, we reviewed the outputs, focusing on the quality of checks and how constraints were handled. Additionally, we uploaded the AI-powered test cases and checklists to the GPT-5.2 model for a more comprehensive review. The final assessment combines our analysis with the latest GPT model’s evaluation.

Online paid LLMs Local free LLMs
GPT 4.1 GPT 4o GPT 5.2 Llama 3.2 Qwen2.5 7B Mistral 7B DeepSeek-Coder 6.7B Instruct
Parameters/size No open data, but approximately 500-5000B+ 3B, 3.4Gb 7B, 4.7Gb 7B, 4.4Gb ~7B, 3.8Gb
Generation of 10 TCs (seconds) 19.4s 15.7s 12.1s 86s 168s 153s 60s**
Tokens (550out/1300in) / cost $0.006 $0.008 $0.009 free
Speed Slower Medium Fast Fast Slower Slower Fast
Generation of 10 checklists 6.7s 9.6s 3.9s 21s 42s 54s 55s
Cost for tokens (550in/250out) $0.004 $0.006 $0.008 free
Quality of output (1-10) 9 8* 9 5 6 7 4

* Some tests are too generic, steps are underspecified, and there are logic duplications in the checklist 
**  It was fast, but the quality of the output was worse than expected, as constraints were skipped 

Translating user stories into testable requirements: online store purchase

Let’s start with the user story. Imagine that as a registered customer, you want to purchase a product from the online store, so that you can receive the selected item at your address.

In this business context, the online store allows customers to browse products, add them to their cart, and complete purchases with different payment options. This user story covers the end-to-end checkout flow for a single product.

Key assumptions

Before defining test coverage, we establish several baseline conditions:

  • User is registered and logged in
  • Product is available in stock
  • Delivery service is available for the user’s address
  • The payment gateway is operational

Acceptance criteria (AC)

We structured our acceptance criteria using the Given-When-Then format for clarity. This structure enables comprehensive test coverage by making each criterion independently testable.

AC1: Given the user is logged in, when the user adds a product to the cart, then the product is displayed in the cart with the correct price and quantity.

AC2: Given the cart contains a product, when the user proceeds to checkout, the checkout page is displayed with the order summary.

AC3: Given that the checkout page is open, when the user enters a valid delivery address, the address is accepted and saved.

AC4: Given that a valid delivery address is provided, when the user selects a payment method and confirms payment, the payment is processed successfully.

AC5: Given that the payment is successful, when the order is completed, then an order confirmation page is displayed with the order number.

AC6: Given that the order is completed, the user receives an order confirmation email.

Scope boundaries

To maintain focused test coverage, we excluded:

  • Guest checkout
  • Multiple product purchase
  • Order cancellation and refunds
  • Promotions and discount codes

Testability notes (for test automation)

  • Each acceptance criterion can be mapped to at least one test automation scenario
  • Checkout flow is suitable for API, UI, and end-to-end automation
  • Stable test data required for products, users, and payment methods
  • Mocks or stubs are recommended for the payment gateway in lower environments

Potential test types

  • UI test automation (Add to cart, checkout, confirmation)
  • API test automation (Cart, Order, Payment endpoints)
  • Integration testing (Payment and email services)
  • Regression testing
  • Smoke testing

Prompts

Prompt 10TCs

“Role: you are an experienced QA/Test Engineer. Given the following requirement(s), generate comprehensive 10 test cases that cover positive, negative, boundary, and edge cases. For each test case, include: Test Case ID, Description, Preconditions, Steps, Expected Result, and Actual Result (to be filled in during execution). Follow Acceptance Criteria. The number of test steps depends on how detailed the requirements are; if the requirements are not clear, steps should be high level. Requirements:”

Prompt 10 checklists

“Role: you are an experienced QA/Test Engineer. Generate 10 Test cases, including positive and negative for the given requirement at the end. The number of test cases should depend on the complexity of a requirement. Do not add description; provide just numbered test cases. Each test should start from 'Verify'. Requirements:”

Examples of generated test cases and checklists

Artificial intelligence
Intelligent automation
1. GPT 4.1 

Prompt 10TCs  results

Test case 1: add product to cart (positive test for AC1)  

Test case ID: TC_US001_01 

Description: Verify that a logged-in user can add an in-stock product to the cart, and it is displayed at the correct price and quantity. 

Steps: 

  1. Navigate to the product listing page.  
  2. Select a product.  
  3. Click "Add to Cart".  
  4. View the cart.  

Expected Result: the product is present in the cart with the correct price and quantity (default quantity 1).  

Actual result: to be filled during execution 

Prompt 10 checklists results

Verify that a logged-in user can successfully add an available product to the cart, and the product appears in the cart with the correct price and quantity.  

Verify that, after adding a product to the cart, proceeding to checkout displays the checkout page with the correct order summary.  

Verify that entering a valid delivery address on the checkout page is accepted and saved.  

Verify that providing a valid delivery address and selecting a valid payment method allows the user to complete the payment successfully.  

Verify that after successful payment, an order confirmation page is displayed with a unique order number.  

Verify that after order completion, an order confirmation email is sent to the user with the correct order details.  

Verify that adding a product to the cart does not work if the product is out of stock (negative).  

Verify that entering an invalid or incomplete delivery address during checkout is rejected and appropriate error messages are displayed (negative).  

Verify that payment cannot be completed if invalid payment details are entered and an appropriate error message is shown (negative).  

Verify that the user cannot access the checkout or complete a purchase without being logged in (negative).  

2. GPT 4o 
3. GPT 5.2
4. LLama 3.2 
5. Qwen2.5 7b
6. Mistral 7b-instruct 
7. DeepSeek-coder 6.7b-instruct 

Challenges with local LLMs

ChatGPT typically prioritises explicit numeric constraints. For example, when instructed to “create 10 test cases,” it reliably generates exactly ten.

Local LLMs, however, often prioritise semantic reasoning over hard constraints. For instance, if the prompt says, “generate 10 test cases,” but also includes “the number of test cases should depend on the complexity of the requirement,” a local LLM may treat the numeric constraint as a soft guideline and produce 8–12 test cases instead. If I remove the second constraint: “the number of test cases should depend on the complexity of the requirement”, it will still generate more than 2 test cases. This illustrates how local LLMs behave in contrast to online paid services.

Additionally, even when explicitly instructed not to use a table format, some models (e.g., DeepSeek) still ignore this constraint and return the output in a table.

Model-based results assessment

In short, the paid ChatGPT/OpenAI API models (GPT-4.1, GPT-4o, and GPT-5.2) delivered the best overall results, which is expected. These models are trained on hundreds of billions (500–1000B+) of parameters, compared to the 7B-parameter local LLMs used in this evaluation.

For clarity, it makes sense to evaluate paid and free models separately.

Among the paid models, GPT-4.1 and GPT-5.2 demonstrated the best quality-to-cost ratio, with GPT-5.2 being the fastest in terms of response time while maintaining high output quality.

When analysing artifacts generated by free/local models, Qwen 2.5 7B and Mistral 7B produced the highest-quality outputs. Although they are slower than LLaMA 3.2, the overall artifact quality is noticeably better, especially for structured QA outputs such as test cases and checklists.

Model Overall Main strengths Main problems to fix
DeepSeek-Coder 6.7B 4/10 Has an AC-mapped table; includes some negative coverage Checklist contains unclear items ("empty product"), scope creep ("account dashboard"), and duplicates ACs without real verifiable checks.

Does not apply constraint - "Constraint: do not use table form"

LLaMA 3.2 5/10 Many cases; includes boundary ideas Misaligned/irrelevant boundaries (min/max price), weak link to ACs, checklist includes items not in story ("order history", "services stable")
Mistral 7B 7/10 Good AC coverage + some sensible negatives (invalid address, unsupported payment) A few negatives conflict with "Assumptions" unless explicitly marked as resilience tests; some steps are too generic ("complete checkout process")
Qwen2.5 7B 6/10 Reasonable positives + negatives (empty cart, invalid qty, no payment method) Some items are not in requirements ("no product selected before checkout"), some redundancy in the checklist, and missing explicit email/order-number validation details
GPT-4.1 9/10 Best-balanced set: E2E, boundary, clear negatives Still missing some "oracles" (what exactly in order summary/email), and a couple extras beyond ACs (login-negative) should be labelled as non-AC
GPT-4o 8/10 Solid coverage, one strong edge case (race for last item) Some expected results are a bit generic (no explicit "order number format/persistence")
GPT-5.2 9/10 Clean mapping to AC1–AC6 + good negative set (empty cart, invalid address, payment failure, stock boundary) A couple of items are vague for test automation ("observe the page"), and it mixes assumptions vs negative resiliency without labelling

Comparison responses: ChatGPT vs local LLM

Aspect ChatGPT Local LLM Comments
Numeric constraints Strong Weak ChatGPT implicitly prioritizes the explicit numeric constraint. Local LLMs often prioritize semantic reasoning over hard constraints.
Instruction conflict resolution Good Poor Many local LLMs don't enforce numeric constraints strictly, especially when:

-the context is long

-the task is "analytical" (QA, testing, reasoning)

Abstraction ability High Medium/Low Local LLMs often lack strong abstraction/compression ability.
QA completeness bias Balanced Over-complete LLM thinks that over-producing test cases is safer than under-producing. So, if you say directly – generate me just 10 test cases or 2, it will consider it as a soft constraint and will decide by its own exact number of test cases to generate.

Which local model to choose for QA and activities like test design?

Each mode is characterised by a very important metric: the number of parameters. 

LLM parameters are the internal numeric weights of a large language model that store what the model has learned from data.

In simple terms, parameters are the “memory” of an LLM.

Each parameter is a number that influences how the model:

  • understands words and context
  • connects concepts
  • predicts the next token

So, for these experiments, we’ve chosen 7B because this is a small LLM; it is a sweet spot, it can be easily deployed on machines with no GPU or limited resources (16-32 GB memory is enough), it also provides good reasoning.

Large Models (13B+) have better reasoning, but they become slow on CPU, and require high RAM (20-40GB).

If you have access to powerful machines with GPU support, using local LLMs is absolutely a viable and often preferable option. With sufficient hardware, local models can deliver strong results while preserving data privacy and avoiding cloud dependency.

However, for everyday and relatively simple tasks, such as generating test cases, checklists, or basic test documentation, small LLMs up to 8B parameters can already do a very good job and significantly speed up QA-related work.

If you have the option to use online services such as ChatGPT/OpenAI or Gemini (or similar platforms), they are generally recommended. These models are trained on vast amounts of data and typically provide higher precision, better instruction adherence, and more consistent outputs, especially for structured artifacts.

When a local LLM is the only option, it is advisable, together with IT support, to allocate the most powerful machine available. Based on the available hardware, you should then decide:

  • which local LLM runtime to use (e.g., Ollama or LM Studio), and
  • which specific model best fits your priorities.

At the time of this experiment, Qwen 2.5 7B and Mistral 7B delivered good-quality results, but they required almost twice the execution time compared to faster models such as LLaMA 3.2.

This ultimately becomes a trade-off decision. If speed is the top priority, faster models like LLaMA are a better choice. If output quality is more important, slightly slower models may be justified.

Conclusions

The choice between paid and local LLMs ultimately depends on your specific constraints and priorities. The key is understanding the trade-offs: faster models sacrifice some quality, while higher-quality local models require more processing time.

It is also important to note that models evolve rapidly. Before making a final decision, always review the latest benchmarks and evaluations for the models you are considering. Public benchmark platforms can provide up-to-date insights into quality, speed, and resource consumption, helping you make a more informed choice.

AI changes how software testing is done by looking at business risks, user impact, and past failures to highlight the most important features and workflows. This helps teams focus their testing where it matters most, boosting both efficiency and coverage. By using historical data and business priorities, AI can point out which parts need the most attention. Still, many complex AI models make decisions without clear explanations, which makes debugging and trusting their results hard. That’s why human oversight is crucial to keep the testing process reliable and high-quality.

Skip the section

FAQs

How is AI used in quality assurance?

AI test automation tools and modern frameworks make it easier to manage and run large test suites. Generative AI can create test data automatically, which helps scale and adapt testing. AI also improves QA by generating test cases, predicting where defects might occur using code and telemetry, creating test data, and supporting self-healing UI testing and visual checks to spot regressions faster. Plus, it offers analytics that point out flaky tests, coverage gaps, and possible root causes, so teams can focus on what matters most.

Can AI do software testing?
What is the best AI to test a software?
What is an LLM in software testing?
What are software testing tools?
How to use AI as a QA tester?
What is LLM testing?
Talk to experts
Listen to the article 16 min
Evaluating Paid vs Local LLMs for QA Test Design: A Practical ComparisonEvaluating Paid vs Local LLMs for QA Test Design: A Practical Comparison
Evaluating Paid vs Local LLMs for QA Test Design: A Practical Comparison
Evaluating Paid vs Local LLMs for QA Test Design: A Practical Comparison
0:00 0:00
Speed
1x
Skip the section
Contact Us
  • This field is for validation purposes and should be left unchanged.
  • We need your name to know how to address you
  • We need your phone number to reach you with response to your request
  • We need your country of business to know from what office to contact you
  • We need your company name to know your background and how we can use our experience to help you
  • Accepted file types: jpg, gif, png, pdf, doc, docx, xls, xlsx, ppt, pptx, Max. file size: 10 MB.
(jpg, gif, png, pdf, doc, docx, xls, xlsx, ppt, pptx, PNG)

We will add your info to our CRM for contacting you regarding your request. For more info please consult our privacy policy

What our customers say

The breadth of knowledge and understanding that ELEKS has within its walls allows us to leverage that expertise to make superior deliverables for our customers. When you work with ELEKS, you are working with the top 1% of the aptitude and engineering excellence of the whole country.

sam fleming
Sam Fleming
President, Fleming-AOD

Right from the start, we really liked ELEKS’ commitment and engagement. They came to us with their best people to try to understand our context, our business idea, and developed the first prototype with us. They were very professional and very customer oriented. I think, without ELEKS it probably would not have been possible to have such a successful product in such a short period of time.

Caroline Aumeran
Caroline Aumeran
Head of Product Development, appygas

ELEKS has been involved in the development of a number of our consumer-facing websites and mobile applications that allow our customers to easily track their shipments, get the information they need as well as stay in touch with us. We’ve appreciated the level of ELEKS’ expertise, responsiveness and attention to details.

samer-min
Samer Awajan
CTO, Aramex