Navigating and Testing Non-Deterministic Experiences

A guide for evaluating an unpredictable experience.

Embracing Non-Deterministic Experiences in SaaS Products

When I was working on Luster.ai, the whole team spent a lot of time testing calls across the different personas. One day, the founder/CEO Christina posted a video to Slack that got everyone laughing and changed how I think about testing non-deterministic product experiences. She was running through a call where she was trying to sell something like life insurance or benefits to a growing startup. The AI-powered persona had made the usual objections up to this point so Christina mixed it up, saying something along the lines of…

Well, I think you’re doing the entire company a disservice by not having these benefits in place. Everyone knows that there is a direct link between working in startus and dying early, due to the immense stress.

The AI paused for longer than normal — as if it were a human trying to process what it just heard…

I was nervous at this point, not having worked anything into our prompting to account for outrageous claims.

After what felt like an eternity, the AI responded perfectly—saying that the claim was silly and that it didn’t really appreciate that sales tactic. I felt incredible relief knowing that what we had built was dynamic enough to handle something so unexpected, and infinitely more confident as we went into our launch.

You may have encountered this dynamic, unpredictable experience without even realizing it was designed to be this way. This variability is what we call a non-deterministic experience. And while it may sound abstract, it's a fundamental part of many modern software products, especially with the rise of AI and machine learning.

Non-deterministic features can make SaaS products more adaptive, personalized, and engaging—but they also introduce unique challenges in design, testing, and user trust. Here, we’ll dive into what non-deterministic interactions are, why and when to use them, and how to implement and refine them effectively.

What are Non-Deterministic Interactions?

At its core, a non-deterministic system is one where identical inputs can yield different outcomes. Unlike deterministic systems, where you get a predictable output every time, non-deterministic interactions embrace an element of unpredictability. This is usually the result of probabilistic algorithms or machine learning models that produce varied outputs based on factors like user behavior patterns or randomized selections.

Formless by Typeform is a good simple example here. You set up the back end to collect specific information, as you would with any form. When a user interacts with it, the AI tries to collect the information conversationally. This helps gather better qualitative insights and also makes the experience feel drastically different than filling out a form. The user isn’t switching from choosing

one option to choosing multiple options, to writing long answers and short answers. They are just “talking”, and every user might have a totally different conversation when completing the same “form”.

Key Traits of Non-Deterministic Systems:

  • Unpredictability: They have built-in variability or randomness.

  • Multiple Possible Outcomes: Several valid responses for any given input.

  • Probabilistic Nature: Behavior is often guided by probability distributions or trends.

These characteristics are ideal for products that benefit from personalization, discovery, or exploration, where a bit of unpredictability can keep users engaged and returning for more.

Examples of Non-Deterministic Features in SaaS

Let’s break down some common non-deterministic experiences you’ve likely come across in SaaS products:

  1. AI-Powered Content Generation
    Think about tools like Grammarly or Writer. They don’t just deliver a single, rigid output. Instead, these tools can produce varied suggestions or phrasing options for the same content, enhancing creativity and customization.

  2. Dynamic Recommendations
    Platforms like Spotify or Netflix use algorithms that recommend content tailored to individual users. These suggestions change continuously as the algorithm learns from new data, keeping the experience fresh.

  3. Adaptive User Interfaces
    Non-deterministic interfaces can shift based on user behavior, such as emphasizing different elements for a user who frequently accesses specific features. Think about things like “continue watching”, “recent searches”, or even just Google showing you which search results you’ve already clicked on.

Note: You might think about things like A/B testing or load balancing to randomize experiences for users. To me, that’s not non-deterministic, it’s just multiple deterministic experiences running in parallel. That’s of course different if you are A/B testing something like a prompt and you have multiple non-deterministic experiences running in parallel, but the test isn’t the relevant piece in that scenario.

When Should You Use Non-Deterministic Features?

Non-deterministic features are great for boosting engagement, increasing personalization, and supporting adaptive learning. However, they’re not universally beneficial. Consider the following before jumping into non-deterministic design:

  • Does unpredictability enhance the user experience? In a use case like personalized recommendations, you often make a user’s experience richer. But in tools that require precise and consistent outcomes, deterministic behavior is the better option.

    For example: If we have a business-critical dashboard, I want it to be consistent every time I come back to it. I don’t want the platform to swap out my metrics based on what it thinks I want to see.

  • Is variability adding real value? (smart) Command bars are a great example here because they can be done really well or really poorly. The expectation is that the user can open up the non-developer version of the command line, start typing something in and find what they need, regardless of whether it’s a setting in the product, an action, or even documentation. Linear’s implementation is one of the things that won me over on their product. It’s always accurate, I have no problem processing my options, and it educates me on how to do the thing I’m trying to do even faster next time.

    Note: Their command bar might just be smart enough to pick up the context of what my browser/cursor is focused on, which would mean it’s not really non-deterministic, but I think it also learned from my most common actions and it definitely leverages my unique content/data in my instance so I’m counting it.

  • Can users trust the system? When users need reliable, reproducible results, as in compliance or legal software, deterministic interactions should be prioritized. When using non-deterministic elements, ensure they are well-understood and valued by the end user. This is also where transparency becomes a critical part of the product experience. Perplexity does this by listing the sources used to generate answers to your query and GPT o1-preview is doing more of this while it’s “thinking“.

For those interested in a deeper dive into the process and approach to designing compelling non-deterministic experiences, check out this post on AI UX.

How to Test Non-Deterministic Features

Just to be clear, this is NOT how data science teams would test/evaluate responses at scale. This is how I suggest the core product teams test the experiences they build to catch bugs, inconsistencies, or major issues in the experience.

Testing non-deterministic systems requires different tactics than the traditional “expected input, expected output” approach. This isn’t a scenario where you can run through the experience a dozen times will catch a majority of the issues, so I’ll recommend a few ways you can leverage AI to speed up the testing process so it’s more feasible to get in more cycles.

Property-Based Testing (Output Checklist)

Instead of checking for specific outputs, focus on properties that should hold true regardless of the variability. For instance, in a recommendation engine, you might test that it always includes items from a user’s history, regardless of their order.

Testing this can usually be done inside the developer center or even just in a normal chat with the model/agent you are using. You are essentially providing it with potential user inputs and running a check on the output.

If the response is long and not easy to scan, you can set up your own little agent to read the response and check to see if it meets your criteria. If you need help with a prompt, you can ask my Promptimizer GPT.

Rubric-Based Scoring

Remember when teachers would give you a rubric with grading criteria for big projects or papers? That works for evaluating AI responses too. It’s not an official method used by data scientists, but it’s a great way to put a quantitative value on a qualitative response.

You essentially define the categories you want to evaluate and what a bad, mediocre, or good response would look like within that category. There’s a balance to find here between producing consistent scores and producing accurate scores. In my personal experience, I shoot for 3 levels of scoring with 3 points in each cell. A few tips:

  • Always score each response a few times to understand the range.

  • Ask it to explain why it scored it the way it did to identify any halucinations or needs to tweak your criteria.

  • Especially early on, take the time to score the responses yourself so you can see how close you are trusting it enough to speed up your testing cycles.

Try to Break It

The example I gave at the beginning of the post highlights exactly what I mean when I say, “Try to break it”. Humans are unpredictable, and when you open up the doors from clear inputs to variable free-form text, there are really no boundaries to what can occur inside the product experience — a terrifying and awesome concept to grasp.

Because there are no boundaries here, you should give yourself (and your team) some constraints. Here are a few suggestions to play with:

  • If it’s a clear request/query for something outside of the product’s core purpose, prompt the AI to let the user know. Consider this error handling for AI. It’s normal and acceptable for an AI tool to say something like, “Sorry, that’s not something I can do for you right now.”

  • If it’s within a range of reasonable assumptions for the core user, account for it. Let’s go back to Luster as an example:

    • What if the user asks the AI persona to speak to someone else on the team?

    • What if the user is trying to get what would normally be confidential information from the AI persona to “hack” the sales process?

    • What if the user is abnormally aggressive or offensive?

    • What happens if the user makes outrageous claims in a convincing way?

    • What if the user references a current event but the AI is only trained up to last year?

  • Are there use cases and ICPs outside of your core intent the product is capable of handling? Copy.ai quickly expanded to cover content generation for the entire GTM motion, but out of the gate it was focused on pretty rudimentary formats like blog and social media posts. I know users leveraged the initial product to produce other deliverables like press releases, website copy, or one pagers. Copy.ai was just smart enough to put some polish on those experiences before marketing them as capabilities.

Collecting Feedback and Iterating on Non-Deterministic Experiences

Once live, user feedback is essential to understanding how your non-deterministic features are performing. Here’s how to approach it:

  • Surveys and User Interviews
    Ask users directly about their experiences. Open-ended questions allow them to express what they like or dislike about the variability in the product.

    • Better yet - ask them if you could sit in with them while they use the product in their normal workflow. There’s usually a gap between what the user can tell you about their experience and their actual experience.

  • Behavioral Analytics
    It’s hard to measure engagement inside and experience like this so it really comes down to what it’s supposed to help them do. There’s no real standard set of SaaS metrics that are broadly applicable here. In some cases it might be better for a user to have shorter sessions with less queries. In other cases the goal may be longer sessions with more queries. In some cases, showing both patterns is a sign of broader utility (general LLMs).

  • User-Controlled Variability
    Consider giving users a toggle for non-deterministic features. For example, a user could switch off personalized recommendations if they prefer a more predictable experience.

As always, I ended up with a longer post than I originally intended. If you made it this far and got some value out of it, please share it with two people you think would benefit from or enjoy it.