If you’re a developer, a QA engineer, or anyone facing a long list of tickets, you know the feeling. You’re looking at a blank screen with a ton of work ahead, maybe it’s test cases or documentation, and you realise ChatGPT could handle it in seconds.
It’s tempting to take the easy way, isn’t it? You know the tool exists and how to use it, but then you hit a wall, because the company’s security rules stop you from sharing internal data on public clouds. Or maybe you’re a startup and the finance team just saw the GPT-4 API bill and shut it down. It’s a common problem in today’s offices: everyone wants the benefits of AI, but privacy rules or budgets get in the way.
That brings us to the main point of this article: a comparison of paid models, such as GPT-4 and GPT-5.2, with free, locally hosted open-source models that you can run on your own hardware. We’ll focus on how they perform in quality assurance and test design. AI-powered testing is especially useful for web and mobile apps because they need thorough testing on many devices and browsers to deliver a great user experience. The main question is whether a locally hosted model on an average computer can really compete with the paid options.
We ran these experiments on a machine that is much more realistic for a standard corporate environment. The setup included an Intel Xeon CPU running at 2 GHz and 32 GB of RAM without a GPU or advanced graphics card hardware that most companies can easily provision.
- Ollama works best for developers who need automation and API integration. LM Studio is better for those who want to explore interactively and experiment with prompts. Both run models locally to keep your data private.
- Paid models like GPT-4.1 and GPT-5.2 usually deliver better quality, speed, and follow rules more closely than free local models. Still, local LLMs are a good option when cloud services aren’t allowed due to security rules or budget limits.
- For QA tasks, 7B parameter models strike the best balance between resource requirements and performance.
How to choose your local LLM?
If you want to run large language models on your own computer without sending data to the cloud, there are two main options to consider: Ollama and LM Studio. Both let you use LLMs locally, which is helpful if your company blocks cloud services like ChatGPT or when budgets are tight. However, each tool is designed for a different kind of user.
Ollama
Ollama is a local LLM runtime mainly aimed at developers.
It runs models in the background and exposes a REST API that works much like the OpenAI API, just fully local. This makes it a good fit for automation, integrations, agents, MCP servers, and CI/CD pipelines. It also handles concurrent requests better, which matters if multiple tools or scripts need to use the model at the same time.
A simple way to think about Ollama is: “Docker for LLMs.” It also provides a graphical user interface, but it is very simple and limited in functionality
LM Studio
LM Studio is a desktop application with a graphical interface.
It focuses on ease of use. You can browse available models, download them, and start chatting right away in a chat-style UI. It’s great for trying out prompts, experimenting with different models, and doing demos. LM Studio also exposes a REST API, but automation and backend usage are not its main strengths. This makes LM Studio more suitable for individual or private use rather than large-scale automation.
A good mental model here is: “ChatGPT, but running locally.”
Which one to pick?
- Ollama is the better choice if you're a developer who needs automation, integrations, or programmatic access.
- LM Studio is best if you want to explore models, learn how they behave, or tune prompts interactively.
Both tools run models locally, keep your data private, and support popular open-source LLMs. The difference is not in what models they run, but in how you plan to use them.
AI for software testing: approach and results of ELEKS’ experiment
For the experiments, we used the same prompt to generate 10 test cases and a separate prompt to generate 10 checklists. Below you will find a detailed user story, the exact prompts used, and sample outputs for each run.
To measure time-related metrics, we used a custom script that calls both the OpenAI API and the Ollama API running locally on a virtual machine. This allowed us to compare response times under the same conditions.
Results of running prompts (test design) on different models
The table below summarises the captured metrics. We measured execution time using a generated script (TypeScript + OpenAI API, Ollama API). To calculate the quality and cost score, we reviewed the outputs, focusing on the quality of checks and how constraints were handled. Additionally, we uploaded the AI-powered test cases and checklists to the GPT-5.2 model for a more comprehensive review. The final assessment combines our analysis with the latest GPT model’s evaluation.
| Online paid LLMs | Local free LLMs | ||||||
|---|---|---|---|---|---|---|---|
| GPT 4.1 | GPT 4o | GPT 5.2 | Llama 3.2 | Qwen2.5 7B | Mistral 7B | DeepSeek-Coder 6.7B Instruct | |
| Parameters/size | No open data, but approximately 500-5000B+ | 3B, 3.4Gb | 7B, 4.7Gb | 7B, 4.4Gb | ~7B, 3.8Gb | ||
| Generation of 10 TCs (seconds) | 19.4s | 15.7s | 12.1s | 86s | 168s | 153s | 60s** |
| Tokens (550out/1300in) / cost | $0.006 | $0.008 | $0.009 | free | |||
| Speed | Slower | Medium | Fast | Fast | Slower | Slower | Fast |
| Generation of 10 checklists | 6.7s | 9.6s | 3.9s | 21s | 42s | 54s | 55s |
| Cost for tokens (550in/250out) | $0.004 | $0.006 | $0.008 | free | |||
| Quality of output (1-10) | 9 | 8* | 9 | 5 | 6 | 7 | 4 |
* Some tests are too generic, steps are underspecified, and there are logic duplications in the checklist
** It was fast, but the quality of the output was worse than expected, as constraints were skipped
Translating user stories into testable requirements: online store purchase
Let’s start with the user story. Imagine that as a registered customer, you want to purchase a product from the online store, so that you can receive the selected item at your address.
In this business context, the online store allows customers to browse products, add them to their cart, and complete purchases with different payment options. This user story covers the end-to-end checkout flow for a single product.
Key assumptions
Before defining test coverage, we establish several baseline conditions:
- User is registered and logged in
- Product is available in stock
- Delivery service is available for the user’s address
- The payment gateway is operational
Acceptance criteria (AC)
We structured our acceptance criteria using the Given-When-Then format for clarity. This structure enables comprehensive test coverage by making each criterion independently testable.
AC1: Given the user is logged in, when the user adds a product to the cart, then the product is displayed in the cart with the correct price and quantity.
AC2: Given the cart contains a product, when the user proceeds to checkout, the checkout page is displayed with the order summary.
AC3: Given that the checkout page is open, when the user enters a valid delivery address, the address is accepted and saved.
AC4: Given that a valid delivery address is provided, when the user selects a payment method and confirms payment, the payment is processed successfully.
AC5: Given that the payment is successful, when the order is completed, then an order confirmation page is displayed with the order number.
AC6: Given that the order is completed, the user receives an order confirmation email.
Scope boundaries
To maintain focused test coverage, we excluded:
- Guest checkout
- Multiple product purchase
- Order cancellation and refunds
- Promotions and discount codes
Testability notes (for test automation)
- Each acceptance criterion can be mapped to at least one test automation scenario
- Checkout flow is suitable for API, UI, and end-to-end automation
- Stable test data required for products, users, and payment methods
- Mocks or stubs are recommended for the payment gateway in lower environments
Potential test types
- UI test automation (Add to cart, checkout, confirmation)
- API test automation (Cart, Order, Payment endpoints)
- Integration testing (Payment and email services)
- Regression testing
- Smoke testing
Prompts
Prompt 10TCs
“Role: you are an experienced QA/Test Engineer. Given the following requirement(s), generate comprehensive 10 test cases that cover positive, negative, boundary, and edge cases. For each test case, include: Test Case ID, Description, Preconditions, Steps, Expected Result, and Actual Result (to be filled in during execution). Follow Acceptance Criteria. The number of test steps depends on how detailed the requirements are; if the requirements are not clear, steps should be high level. Requirements:”
Prompt 10 checklists
“Role: you are an experienced QA/Test Engineer. Generate 10 Test cases, including positive and negative for the given requirement at the end. The number of test cases should depend on the complexity of a requirement. Do not add description; provide just numbered test cases. Each test should start from 'Verify'. Requirements:”
Examples of generated test cases and checklists
Prompt 10TCs results
Test case 1: add product to cart (positive test for AC1)
Test case ID: TC_US001_01
Description: Verify that a logged-in user can add an in-stock product to the cart, and it is displayed at the correct price and quantity.
Steps:
- Navigate to the product listing page.
- Select a product.
- Click "Add to Cart".
- View the cart.
Expected Result: the product is present in the cart with the correct price and quantity (default quantity 1).
Actual result: to be filled during execution
Prompt 10 checklists results
Verify that a logged-in user can successfully add an available product to the cart, and the product appears in the cart with the correct price and quantity.
Verify that, after adding a product to the cart, proceeding to checkout displays the checkout page with the correct order summary.
Verify that entering a valid delivery address on the checkout page is accepted and saved.
Verify that providing a valid delivery address and selecting a valid payment method allows the user to complete the payment successfully.
Verify that after successful payment, an order confirmation page is displayed with a unique order number.
Verify that after order completion, an order confirmation email is sent to the user with the correct order details.
Verify that adding a product to the cart does not work if the product is out of stock (negative).
Verify that entering an invalid or incomplete delivery address during checkout is rejected and appropriate error messages are displayed (negative).
Verify that payment cannot be completed if invalid payment details are entered and an appropriate error message is shown (negative).
Verify that the user cannot access the checkout or complete a purchase without being logged in (negative).
Prompt 10 checklists results
Verify that a registered user can add a product to the cart and that the cart displays the correct price and quantity.
Verify that the checkout page displays an order summary when a user proceeds to checkout with items in the cart.
Verify that entering a valid delivery address on the checkout page results in the address being accepted and saved.
Verify that selecting a valid payment method and confirming payment result in successful payment processing.
Verify that after a successful payment, the order confirmation page displays with an order number.
Verify that the user receives an order confirmation email after completing the order.
Verify that adding a product with zero stock results in an error preventing the product from being added to the cart.
Verify that entering an invalid delivery address format is rejected on the checkout page.
Verify that failing to select a payment method triggers an error message prompting the user to choose one.
Verify that a simulation of a payment gateway failure results in an appropriate error message and does not allow order completion.
Prompt 10TCs results
Test case ID: TC-US001-01
Description: Verify that a logged-in user can add a product to the cart and see the correct price and quantity.
Preconditions:
- User is registered and logged in
- Product is available in stock
Steps:
- Navigate to the product listing page
- Select a product
- Click “Add to Cart”
- Open the cart
Expected result:
- The selected product is displayed in the cart
- Price and quantity match the product details
Actual result: To be filled during execution
Prompt 10 checklists results
Verify logged-in user can add an in-stock product to the cart and the cart displays the correct product name, price, and quantity.
Verify cart does not allow adding a product when product stock is unavailable.
Verify user can proceed to checkout from cart and checkout page displays correct order summary.
Verify checkout page is not accessible when cart is empty.
Verify system accepts and saves a valid delivery address on checkout page.
Verify system rejects invalid or incomplete delivery address and displays appropriate error message.
Verify user can successfully complete payment using a valid payment method and payment is processed.
Verify payment fails and order is not completed when an invalid or declined payment method is used.
Verify the order confirmation page is displayed with a unique order number after successful payment.
Verify order confirmation email is sent to the registered user’s email address after order completion.
Prompt 10TCs results
Test case ID: TC001
Description: Positive test case for adding a product to the cart
Preconditions: User is logged in, product is available
Steps:
- Log in with valid credentials,
- Browse products, select the desired product and add to cart,
- Proceed to checkout.
Expected result: Product is displayed in cart with correct price and quantity.
Actual result: to be filled in during execution
Prompt 10 checklists results
Verify that the user is logged in successfully.
Verify that a valid product is added to the cart with the correct price and quantity.
Verify that the checkout page is displayed when adding a product to the cart.
Verify that the delivery address is accepted and saved when entered correctly.
Verify that payment processing is successful using a valid payment method.
Verify that the order completion status is confirmed after payment processing.
Verify that an order confirmation email is sent successfully.
Verify that the order number is displayed on the confirmation page.
Verify that the user can view their order history with the correct details.
Verify that the payment gateway and email services are stable and functional
Prompt 10TCs results
Test case ID: TC01
Description: Verify that a logged-in user can add a product to the cart.
Preconditions: User is registered and logged in; Product is available in stock.
Steps:
- Log in to the online store with valid credentials.
- Browse through the product catalogue and select a product.
- Click "Add to Cart" for the selected product.
- Verify that the product displays correctly in the cart, including price and quantity.
Expected result: the product should be added to the cart with the correct price and quantity displayed.
Actual result: to be filled in during execution
Prompt 10 checklists results
Verify the user can add a product to the cart.
Verify that adding multiple products to the cart works as expected.
Verify the correct price is displayed for a single added product.
Verify that an error message is shown if no product is selected before proceeding to checkout.
Verify the checkout page is displayed after clicking "Proceed to Checkout".
Verify the order summary on the checkout page matches the cart contents.
Verify that an invalid address format results in an error.
Verify that a successful payment redirects to the order confirmation page.
Verify the order confirmation page displays the correct order details.
Verify that the user receives an email with the order details after completing the purchase.
Prompt 10TCs results
Test case ID: TC-001-AddProductToCart
Description: Verify that the product is added to the cart correctly
Preconditions: User is logged in, product is available in stock
Steps:
- Log in as a registered user
- Navigate to the product page
- Click the 'Add to Cart' button for the selected product
Expected result: the product should be added to the cart with the correct price and quantity
Actual result: to be filled during execution
Prompt 10 checklists results
Verify that a registered user can add a product to the cart, and it displays correctly with price and quantity.
Verify that when a registered user proceeds to checkout from the cart, the checkout page is displayed with the order summary.
Verify that when a valid delivery address is entered during checkout, it is accepted and saved.
Verify successful payment processing when a payment method is selected and confirmed during checkout.
Verify that upon successful payment, an order confirmation page is displayed with the correct order number.
Verify that an order confirmation email is sent to the user's registered email address.
Verify negative scenario: Checkout flow when there are no products in the cart.
Verify negative scenario: Checkout flow for a product that is out of stock.
Verify negative scenario: Checkout flow when the delivery service is not available for the user’s address.
Verify negative scenario: Checkout flow when the payment gateway is not operational.
Prompt 10 checklists results
Verify AC1: The user is logged in when the product is added to the cart, the product should be displayed with correct price and quantity.
Verify AC1: The user can add multiple quantities of a single product to the cart without issue.
Verify AC2: The checkout page appears when the user proceeds to checkout after adding products to the cart.
Verify AC3: A valid delivery address is accepted and saved upon entering on the checkout page.
Verify AC3: An error message or alert displays if an invalid delivery address is entered.
Verify AC4: The user can select a payment method, confirm payment without issue when providing a valid delivery address.
Verify AC5: Order confirmation page appears upon successful payment with order number displayed on the page.
Verify AC6: An order confirmation email is received for successfully placed orders.
Verify AC1 and AC2: Ensure the products added to cart are reflected accurately in the user's account dashboard.
Verify negative scenario of entering an empty product to the cart. The system should prevent such actions without errors or issues.
Challenges with local LLMs
ChatGPT typically prioritises explicit numeric constraints. For example, when instructed to “create 10 test cases,” it reliably generates exactly ten.
Local LLMs, however, often prioritise semantic reasoning over hard constraints. For instance, if the prompt says, “generate 10 test cases,” but also includes “the number of test cases should depend on the complexity of the requirement,” a local LLM may treat the numeric constraint as a soft guideline and produce 8–12 test cases instead. If I remove the second constraint: “the number of test cases should depend on the complexity of the requirement”, it will still generate more than 2 test cases. This illustrates how local LLMs behave in contrast to online paid services.
Additionally, even when explicitly instructed not to use a table format, some models (e.g., DeepSeek) still ignore this constraint and return the output in a table.
Model-based results assessment
In short, the paid ChatGPT/OpenAI API models (GPT-4.1, GPT-4o, and GPT-5.2) delivered the best overall results, which is expected. These models are trained on hundreds of billions (500–1000B+) of parameters, compared to the 7B-parameter local LLMs used in this evaluation.
For clarity, it makes sense to evaluate paid and free models separately.
Among the paid models, GPT-4.1 and GPT-5.2 demonstrated the best quality-to-cost ratio, with GPT-5.2 being the fastest in terms of response time while maintaining high output quality.
When analysing artifacts generated by free/local models, Qwen 2.5 7B and Mistral 7B produced the highest-quality outputs. Although they are slower than LLaMA 3.2, the overall artifact quality is noticeably better, especially for structured QA outputs such as test cases and checklists.
| Model | Overall | Main strengths | Main problems to fix |
|---|---|---|---|
| DeepSeek-Coder 6.7B | 4/10 | Has an AC-mapped table; includes some negative coverage | Checklist contains unclear items ("empty product"), scope creep ("account dashboard"), and duplicates ACs without real verifiable checks.
Does not apply constraint - "Constraint: do not use table form" |
| LLaMA 3.2 | 5/10 | Many cases; includes boundary ideas | Misaligned/irrelevant boundaries (min/max price), weak link to ACs, checklist includes items not in story ("order history", "services stable") |
| Mistral 7B | 7/10 | Good AC coverage + some sensible negatives (invalid address, unsupported payment) | A few negatives conflict with "Assumptions" unless explicitly marked as resilience tests; some steps are too generic ("complete checkout process") |
| Qwen2.5 7B | 6/10 | Reasonable positives + negatives (empty cart, invalid qty, no payment method) | Some items are not in requirements ("no product selected before checkout"), some redundancy in the checklist, and missing explicit email/order-number validation details |
| GPT-4.1 | 9/10 | Best-balanced set: E2E, boundary, clear negatives | Still missing some "oracles" (what exactly in order summary/email), and a couple extras beyond ACs (login-negative) should be labelled as non-AC |
| GPT-4o | 8/10 | Solid coverage, one strong edge case (race for last item) | Some expected results are a bit generic (no explicit "order number format/persistence") |
| GPT-5.2 | 9/10 | Clean mapping to AC1–AC6 + good negative set (empty cart, invalid address, payment failure, stock boundary) | A couple of items are vague for test automation ("observe the page"), and it mixes assumptions vs negative resiliency without labelling |
Comparison responses: ChatGPT vs local LLM
| Aspect | ChatGPT | Local LLM | Comments |
|---|---|---|---|
| Numeric constraints | Strong | Weak | ChatGPT implicitly prioritizes the explicit numeric constraint. Local LLMs often prioritize semantic reasoning over hard constraints. |
| Instruction conflict resolution | Good | Poor | Many local LLMs don't enforce numeric constraints strictly, especially when:
-the context is long -the task is "analytical" (QA, testing, reasoning) |
| Abstraction ability | High | Medium/Low | Local LLMs often lack strong abstraction/compression ability. |
| QA completeness bias | Balanced | Over-complete | LLM thinks that over-producing test cases is safer than under-producing. So, if you say directly – generate me just 10 test cases or 2, it will consider it as a soft constraint and will decide by its own exact number of test cases to generate. |
Which local model to choose for QA and activities like test design?
Each mode is characterised by a very important metric: the number of parameters.
LLM parameters are the internal numeric weights of a large language model that store what the model has learned from data.
In simple terms, parameters are the “memory” of an LLM.
Each parameter is a number that influences how the model:
- understands words and context
- connects concepts
- predicts the next token
So, for these experiments, we’ve chosen 7B because this is a small LLM; it is a sweet spot, it can be easily deployed on machines with no GPU or limited resources (16-32 GB memory is enough), it also provides good reasoning.
Large Models (13B+) have better reasoning, but they become slow on CPU, and require high RAM (20-40GB).
If you have access to powerful machines with GPU support, using local LLMs is absolutely a viable and often preferable option. With sufficient hardware, local models can deliver strong results while preserving data privacy and avoiding cloud dependency.
However, for everyday and relatively simple tasks, such as generating test cases, checklists, or basic test documentation, small LLMs up to 8B parameters can already do a very good job and significantly speed up QA-related work.
If you have the option to use online services such as ChatGPT/OpenAI or Gemini (or similar platforms), they are generally recommended. These models are trained on vast amounts of data and typically provide higher precision, better instruction adherence, and more consistent outputs, especially for structured artifacts.
When a local LLM is the only option, it is advisable, together with IT support, to allocate the most powerful machine available. Based on the available hardware, you should then decide:
- which local LLM runtime to use (e.g., Ollama or LM Studio), and
- which specific model best fits your priorities.
At the time of this experiment, Qwen 2.5 7B and Mistral 7B delivered good-quality results, but they required almost twice the execution time compared to faster models such as LLaMA 3.2.
This ultimately becomes a trade-off decision. If speed is the top priority, faster models like LLaMA are a better choice. If output quality is more important, slightly slower models may be justified.
Conclusions
The choice between paid and local LLMs ultimately depends on your specific constraints and priorities. The key is understanding the trade-offs: faster models sacrifice some quality, while higher-quality local models require more processing time.
It is also important to note that models evolve rapidly. Before making a final decision, always review the latest benchmarks and evaluations for the models you are considering. Public benchmark platforms can provide up-to-date insights into quality, speed, and resource consumption, helping you make a more informed choice.
AI changes how software testing is done by looking at business risks, user impact, and past failures to highlight the most important features and workflows. This helps teams focus their testing where it matters most, boosting both efficiency and coverage. By using historical data and business priorities, AI can point out which parts need the most attention. Still, many complex AI models make decisions without clear explanations, which makes debugging and trusting their results hard. That’s why human oversight is crucial to keep the testing process reliable and high-quality.
FAQs
AI test automation tools and modern frameworks make it easier to manage and run large test suites. Generative AI can create test data automatically, which helps scale and adapt testing. AI also improves QA by generating test cases, predicting where defects might occur using code and telemetry, creating test data, and supporting self-healing UI testing and visual checks to spot regressions faster. Plus, it offers analytics that point out flaky tests, coverage gaps, and possible root causes, so teams can focus on what matters most.
Yes, AI can handle many test automation tasks, such as test creation, regression selection, and anomaly detection, which greatly reduces repetitive manual work. But it does have limits: understanding context, subtle business rules, and making safety or fairness decisions still require human review. Plus, all AI-generated results need to be checked for accuracy.
There’s no single “best” AI; the right choice depends on your stack and goals (unit-test generation, functional/UI test automation, visual testing process, or analytics). Common industry choices include Diffblue for automated Java unit tests, Testim, Mabl, and Applitools for self-healing UI and visual checks, and enterprise suites like Tricentis for wide coverage.
An LLM is a large language model based on transformers, trained on vast amounts of text. In software testing, it helps turn requirements into test cases, automatically generate test scripts, summarise logs, sort bug reports, and generate input data from natural language prompts. Since LLMs can sometimes hallucinate, teams treat their suggestions as drafts that need to be checked.
Software testing tools are programs that help teams create, run, and review tests to make sure software works well and is reliable. They cover tasks like automated regression testing, load testing, performance testing, managing tests, tracking defects, and reporting results. Modern AI testing tools can create test scenarios, spot issues, and lower maintenance by adjusting to changes in the user interface or code. Examples include unit-testing frameworks, API testing platforms, visual testing, and continuous testing pipelines.
AI can assist QA by generating test cases from requirements, creating synthetic test data, prioritising regression suites based on risk, and identifying patterns behind test failures. Teams often mix AI-generated automation with manual testing to validate edge cases, usability, and business logic. AI testing tools can also watch logs and telemetry to recommend improvements for performance or load testing and point out unstable parts. A good way to use AI is to add it to existing CI/CD workflows and measure results against set quality goals.
LLM testing evaluates large language model behaviour to ensure outputs are accurate, safe, and aligned with expected functionality in specific test scenarios. It includes prompt testing, evaluation against benchmarks or labelled datasets, robustness checks (e.g., adversarial prompts), and monitoring for hallucinations or inconsistent responses. Both automated evaluation pipelines and manual testing are important to assess subjective qualities like helpfulness or clarity. Metrics may include accuracy, consistency, latency (for performance testing), and failure analysis to understand recurring test failures.
Related Insights
Inconsistencies may occur.
The breadth of knowledge and understanding that ELEKS has within its walls allows us to leverage that expertise to make superior deliverables for our customers. When you work with ELEKS, you are working with the top 1% of the aptitude and engineering excellence of the whole country.
Right from the start, we really liked ELEKS’ commitment and engagement. They came to us with their best people to try to understand our context, our business idea, and developed the first prototype with us. They were very professional and very customer oriented. I think, without ELEKS it probably would not have been possible to have such a successful product in such a short period of time.
ELEKS has been involved in the development of a number of our consumer-facing websites and mobile applications that allow our customers to easily track their shipments, get the information they need as well as stay in touch with us. We’ve appreciated the level of ELEKS’ expertise, responsiveness and attention to details.