🌐
Videos Blog About Series πŸ—ΊοΈ
❓
πŸ”‘

Hard Problems πŸ”—
1618336942  

🏷️ blog 🏷️ pairwise

When preparing any tool which you see all the pieces readily available, but that nobody has executed upon, you begin to ask yourself why that is. This is essentially what I've been going through building the pairwise tool.

Every time  I look around and don't see a solution for an old problem on CPAN, my spider-senses start to fire.  I saw no N-dimensional combination methods (only n Choose k) or bin covering algorithms, and when you see a lack of N-dimensional solutions that usually means there is a lack of closed form general solutions to that problem.  While this is not true for my problem space, it rubs right up against the edge of NP hard problems.  So it's not exactly shocking I didn't see anything fit to purpose.

The idea behind pairwise test execution is actually quite simple, but the constraints of the software systems surrounding it risk making it more complex than is manageable. This is because unless we confine ourselves to a very specific set of constraints, we run into not one, but two NP hard problems. We could then be forced into the unfortunate situation where we have to use Polynomial time approximations.

I've run into this a few times in my career. Each time the team grows disheartened as what the customer wants seems on the surface to be impossible. I always remember that there is always a way to win by cheating (more tight constraints). Even the tyranny of the rocket equation was overcome through these means (let's put a little rocket on a big one!)

Breaking it down

The first problem is that N-Wise test choosing is simply a combination.
This results in far, far more platforms to test than is practical once you get beyond 3 independent variables relevant to your system under test. For example:

A combination with 3 sets containing 3, 5 and 8 will result in 3 * 5 * 8 = 120 systems under test! Adding in a fourth or fifth will quickly bring you into the territory of thousands of systems to test.  While this is straightforward to accomplish these days, it is quite expensive.

What we actually want is an expression of the pigeonhole principle.  We wish to build sets where every element of each component set is seen at least once, as this will cover everything with the minimum number of needed systems under test.  This preserves the practical purpose of pairwise testing quite nicely.

In summary, we have a clique problem and a bin covering problem. This means that we have to build a number of bins from X number of sets each containing some amount of members. We then have to fill said bins with a bunch of tests in a way which will result in them being executed as fast as is possible.

Each bin we build will represent some system under test, and each set from which we build these bins a particular important attribute. For example, consider these sets:

  • Operating Systems: Windows, Linux, OSX
  • Processor Architecture: 32-bit, 64-bit
  • Browser: Firefox, Chrome, Safari, Brave, Opera, SeaMonkey

A random selection will result in an optimal multi-dimensional "pairwise" set of systems under test:

  1. Firefox - Windows - 64 Bit
  2. Chrome - Linux - 32 Bit
  3. Safari - Windows - 32 Bit
  4. Brave - OSX - 32 Bit
  5. Opera - OSX - 64 Bit
  6. SeaMonkey - Linux - 64-Bit

The idea is to pick one of each of the set with the most members and then pick from the remaining ones at the index of the current pick from the big set modulo the smaller set's size. This is the "weak" form of the Pigeonhole Principle in action, which is why it is solved easily with the Chinese remainder theorem.

Sometimes you can oversimplify

You may have noticed that perhaps we are going too far with our constraints here. This brings in danger, as the "strong" general form of the pigeonhole principle means we are treading into the waters of Ramsey's (clique) problem. For example, if we drop either of these two assumptions we can derive from our sets:

  1. No element of any given set is repeated
  2. No element of any given set is shared with another

We immediately descend into the realm of the NP hard problem. This is because we are no longer a principal ideal domain and can no longer cheat using the Chinese remainder theorem. In this reality, we are solving the Anti-Clique problem specifically, which is particularly nasty. Thankfully, we can consider those two constraints to be quite realistic.

We will have to account for the fact that the variables are actually not independent. You may have noticed that some of these "optimal" configurations are not actually realistic. Many Operating systems do not support various processor architectures and software packages. Three of the configurations above are currently invalid for at least one reason.  Consider a configuration object like so:

my $conf = {
    PlatformGroups => {
        'Operating Systems' => [qw{CentOS Ubuntu Windows OSX}],
        'CPU Archetechure'  => [qw{32-bit 64-bit}],
        'Browser'           => [qw{Firefox Opera Safari Chrome Iexplore Brave Dillo lynx}],
        'Mail Server'       => [qw{exim courier postfix qmail exchange}],
        'HTTP Server'       => [qw{ngnix apache lighttpd thttpd}],
        'Database Server'   => [qw{postgres mysql mariadb mssql oracle}],
        'Message Queue'     => [qw{rabbitmq zmq}],
        'Search Engine'     => [qw{solr lunr elasticsearch}],
    },
    incompatibilities => {
        'Windows' => [qw{32-bit Safari Dillo qmail exim courier postfix thttpd solr}],
        'OSX'     => [qw{32-bit Iexplore}],
        'CentOS'  => [qw{Iexplore}],
        'Ubuntu'  => [qw{Iexplore}],
    },
};
Thanks to the requirement that all configurations be unique, we can use a simplified data structure here rather than over-complicating the PlatformGroup data structure (and our processor code).

Can we throw away these configurations without simply "re-rolling" the dice?  Unfortunately, no.  Not without using the god algorithm of computing every possible combination ahead of time, and therefore already knowing the answer.  As such our final implementation looks like so:

sub cliques($conf,$tests) {
    my %pgroups = ref $conf->{PlatformGroups} eq 'HASH' ? %{$conf->{PlatformGroups}} : ();
    my @plans;

    # Randomize the ordering of the platform groups for eventual consistency.
    foreach my $pg (keys(%pgroups)) {
        @{$pgroups{$pg}} = shuffle(@{$pgroups{$pg}});
    }

    # The idea here is to have at least one pigeon in each hole.
    # This is accomplished by finding the longest list of groups, and then iterating over everything we have modulo their size.
    my $longest = (sort { scalar(@{$pgroups{$b}}) <=> scalar(@{$pgroups{$a}}) } keys(%pgroups))[0];
    my $llen = scalar(@{$pgroups{$longest}});
    my $tot = scalar(@$tests);

    # Bin covering
    my $remainder = ( $tot % $llen );
    my $to_take = int$tot / $llen);
    my $offset = 0;

    for (my $i=0; $i < $llen$i++) {
        my @newplats;
        foreach my $pgroup ( sort { scalar(@{$pgroups{$b}}) <=> scalar(@{$pgroups{$a}}) } keys(%pgroups)) {
            my $idx = $i % scalar(@{$pgroups{$pgroup}});
            my $orig_idx = $idx;

            # If a partial is invalid, we must re-roll the dice.
            while (!combination_valid($conf@newplats, ,$pgroups{$pgroup}[$idx])) {
                $idx = ($idx + 1) % scalar(@{$pgroups{$pgroup}});
                # Allow for 'incomplete' sets omitting a configuration group entirely due to total incompatibility
                last if $idx == $orig_idx;
            }
            push(@newplats,$pgroups{$pgroup}[$idx]);
        }
push(@plans, \@newplats);
    }
    return \@plans;
}

sub combination_valid ($conf,@combo) {
    my %compat = %{$conf->{incompatibilities}};
    foreach my $key (keys(%compat)) {
        next unless ref $compat{$keyeq 'ARRAY';
        my @compat = grep { my $element = $_defined $element && ( $element eq $key || grep { $element eq $_ } @{$compat{$key}} ) } @combo;
        return 0 if @compat > 1;
    }
    return 1;
}

This brings us to another unmentioned constraint: what happens if a member of a set is incompatible with all members of another set?  It turns out accepting this is actually a significant optimization, as we will end up never having to re-roll an entire sequence.  See the while loop above.

Another complication is the fact that we will have to randomize the set order to achieve the goal of eventual coverage of every possible combination. Given the intention of the tool is to run decentralized and without a central oracle other than git, we'll have to also have use a seed based upon it's current state.  The algorithm above does not implement this, but it should be straightforward to add.

Filling the bins

We at least have a solution to the problem of building the bins. So, we can move on to filling them. Here we will encounter trade-offs which are quite severe. If we wish to accurately reflect reality with our assumptions, we immediately stray into "no closed form solution" territory. This is the Fair Item Allocation problem, but with a significant twist.  To take advantage of our available resources better, we should always execute at least one test. This will result in fewer iterations to run through every possible combination of systems to test, but also means we've cheated by adding a "double spend" on the low-end.  Hooray cheating!

The fastest approximation is essentially to dole out a number of tests equal to the floor of dividing the tests equally among the bins plus floor(  (tests % bins)  / tests ) in the case you have less tests than bins. This has an error which is not significant until you reach millions of tests. We then get eaten alive by rounding error due to flooring.

We could simply add the remainder and give up on fair allocation.  But given the remainder will always be lower than the number of bins, we can just shave one off of it each go-through until we run out (while still retaining the minimum bound of 1).  This is is the optimal solution:

my $choose = int( $total_tests / $bins );
my $remainder = $total_tests % bins;
...
# later in our loop
my
 $take = $choose + ( $remainder && 1 ) || 1;
$remainder-- if $remainder;

From there we simply splice out the relevant elements from the array of tests.  The completed algorithm has some minor differences from cliques() above:

sub cliques($conf,$tests) {
    my %pgroups = ref $conf->{PlatformGroups} eq 'HASH' ? %{$conf->{PlatformGroups}} : ();
    my @plans;

    # Randomize the ordering of the platform groups for eventual consistency.
    foreach my $pg (keys(%pgroups)) {
        @{$pgroups{$pg}} = shuffle(@{$pgroups{$pg}});
    }

    # The idea here is to have at least one pigeon in each hole.
    # This is accomplished by finding the longest list of groups, and then iterating over everything we have modulo their size.
    my $longest = (sort { scalar(@{$pgroups{$b}}) <=> scalar(@{$pgroups{$a}}) } keys(%pgroups))[0];
    my $llen = scalar(@{$pgroups{$longest}});
    my $tot = scalar(@$tests);

    # Bin covering
    my $remainder = ( $tot % $llen );
    my $to_take = int$tot / $llen);
    my $offset = 0;

    for (my $i=0; $i < $llen$i++) {
        my @newplats;
        foreach my $pgroup ( sort { scalar(@{$pgroups{$b}}) <=> scalar(@{$pgroups{$a}}) } keys(%pgroups)) {
            my $idx = $i % scalar(@{$pgroups{$pgroup}});
            my $orig_idx = $idx;

            # If a partial is invalid, we must re-roll the dice.
            while (!combination_valid($conf@newplats, ,$pgroups{$pgroup}[$idx])) {
                $idx = ($idx + 1) % scalar(@{$pgroups{$pgroup}});
                # Allow for 'incomplete' sets omitting a configuration group entirely due to total incompatibility
                last if $idx == $orig_idx;
            }
            push(@newplats,$pgroups{$pgroup}[$idx]);
        }

        my $tt = $to_take + ( $remainder && 1 ) || 1;
        push(@plans,{ tests => [splice(@$tests, $offset$tt)], platforms => \@newplats });
        $remainder-- if $remainder;
        $offset += $tt;

        # Just repeat tests in the event we have more SUTs available than tests
        $offset = $offset % $tot;
    }
    return \@plans;
}

It is worth noting there is yet another minor optimization in our production process here at the end, namely that if we have more systems available for tests than tests to execute, we can achieve total coverage in less iterations by repeating tests from earlier groups.

Trade-offs in my trade-offs

Even this makes some significant assumptions:
  1. Each item we are packing into a bin is of equal size. This means every test is assumed to run in the same amount of time on the same computer.
  2. Each item is indivisible
  3. Each bin values each item equally (in our context this means "every computer executes it in the same amount of time")
  4. Each test will never change in how long it takes to execute when it changes, or the system under test does.
  5. Each bin represents one computer only.

Obviously the only realistic assumption here is #2. If tests can be executed faster by breaking them into smaller tests, the test authors should do so, not an argument builder.

Assumptions #1 and #3, if we take them seriously would not only doom us to solving an NP hard problem, but have a host of other practical issues. Knowing how long each test takes on each computer is quite a large sampling problem, though solvable eventually even using only git tags to store this data. Even then, #4 makes this an exercise in futility. We really have no choice but to accept this source of inefficiency in our production process.

Invalidating #5 does not bring us too much trouble. Since we expect to have a number of test hosts which will satisfy any given configuration from the optimal group and will know how many there are ahead of time, we can simply split the bin over the available hosts and re-run our bin packer over those hosts.

This will inevitably result in a situation where you have an overabundance of available systems under test for some configurations and a shortage of others. Given enough tests, this can result in workflow disruptions. This is a hard problem to solve without "throwing money at the problem", or being more judicious with what configurations you support in the first place. That is the sort of problem an organization wants to have though. It is preferable to the problem of wasting money testing everything on every configuration.

Whither N-wise

Since the name of the tool is pairwise, I may as well also implement and discuss multi-set combinations.  Building these bins is actually quite straightforward, which is somewhat shocking given every algorithm featured for doing pairwise testing at pairwise.org was not in fact the optimal one from my 30 year old combinatorics textbook.  Pretty much all of them used tail-call recursion in languages which do not optimize this, or they took (good) shortcuts which prevented them from functioning in N dimensions.

Essentially you build an iterator which, starting with the first set, pushes a partial combination with every element of its set matched with one of the second onto your stack.
You then repeat the process, considering the first set to be the partial, and crank right through all the remaining sets.

Dealing with incompatibilities is essentially the same procedure as above.  The completed algorithm looks like so:

sub combine($conf,$tests) {
    my %pgroups = ref $conf->{PlatformGroups} eq 'HASH' ? %{$conf->{PlatformGroups}} : ();
    my @plans;

    #construct iterator
    my @pigeonholes = values(%pgroups);
    my $bins = product map { scalar(@$_) } @pigeonholes;
    my $tot_tests = scalar(@$tests);

    # Bin covering
    my $remainder = $tot_tests % $bins;
    my $to_take = int$tot_tests / $bins);

    my $offset = 0;

    my @iterator = @{$pigeonholes[0]};
    while (scalar(@iterator) ) {
        my $subj = shift @iterator;

        #Handle initial elements
        $subj = [$subjif ref $subj ne 'ARRAY';

        #Break out of the loop if we have no more possibilities to exploit
        if (scalar(@$subj) == scalar(@pigeonholes)) {
            my $tt = $to_take + ( $remainder && 1 ) || 1;
            push(@plans, { tests => [ $offset$tt ], platforms => $subj } );
            $remainder-- if $remainder;
            $offset += $tt;
            # Just repeat tests in the event we have more SUTs than tests
            $offset = $offset % $tot_tests;
            next;
        }

        #Keep pushing partials on to the end of the iterator, until we run out of categories to add
        foreach my $element (@{$pigeonholes[scalar(@$subj)]}) {
            my @partial = @$subj;
            # If the combination isn't valid, return an undef member to simplify loop breakout
            # This results in some configurations which are essentially the same.
            # That said, we cannot simply discard them if we wish to cover the case a configuration having incompatibilities with entire configuration groups.
            # We could compress them later to avoid some slop, but it's probably not worth the effort.
            push(@partial, combination_valid($conf,@partial) ? $element : undef );
            push(@iterator,\@partial);
        }
    }
    return \@plans;
}

Uniting all under Heaven

You may have noticed this is a greedy algorithm.  If we decided to use this as a way to generate a cache for a "god algorithm" version of the anti-clique generator above, we could very easily run into memory exhaustion with large enough configuration sets, defeating the purpose. You could flush the partials that are actually complete, but even then you'd only be down to 1/n theoretical memory usage where n is the size of your 2nd largest configuration set (supposing you sort such that it's encountered last).  This may prove "good enough" in practice, especially since users tend to tolerate delays in the "node added to network" phase better than the "trying to run tests" phase.  It would also speed up the matching of available systems under test to the desired configuration supersets, as we could also "already know the answer".

Profiling this showed that I either had to fix my algorithm or resort to this.  My "worst case" example of 100 million tests using the cliques() method took 3s, while generating everything took 4.  Profiling shows the inefficient parts are almost 100% my bin-covering.

Almost all of this time is spent splice()ing huge arrays of tests.  In fact, the vast majority of the time in my test (20s total!) is simply building the sequence (1..100_000_000), which we are using as a substitute for a similar length argument array of tests.

We are in luck, as once again we have an optimization suggested by the constraints of our execution environment.  Given any host only needs to know what it needs to execute we can save only the relevant indices, and do lazy evaluation.  This means our sequence expansion (which takes the most time) has an upper bound of how long it takes to generate up to our offset.  The change is straightforward:

push(@plans,{ tests => [ $offset$tt ], platforms => \@newplats });

The question is, can we cheat even more by starting at our offset too?  Given we are expecting a glob or regex describing a number of files which we don't know ahead of time what will be produced, this seems unlikely.  We could probably speed it up globbing with GLOB_NOSORT. Practically every other sieve trick we can try (see DeMorgan's Laws) is already part of the C library implementing glob itself.  I suspect that we will have to understand the parity problem a great deal better for optimal seeking via search criteria.

Nevertheless, this gets our execution time for the cliques() algorithm down to 10ms, and 3s as the upper bound to generate our sequence isn't bad compared to how long it will take to execute our subset of 100 million tests.  We'd probably slow the program down using a cached solution at this point, not to mention having to deal with the problems inherent with such.  Generating all combinations as we'd have to do to build the cache itself takes another 3s, and there's no reason to punish most users just to handle truly extreme data sets.

It is possible we could optimize our check that a combination is valid, and get a more reasonable execution time for combine() as well.  Here's our routine as a refresher:

sub combination_valid ($conf,@combo) {
    my %compat = %{$conf->{incompatibilities}};
    foreach my $key (keys(%compat)) {
        next unless ref $compat{$keyeq 'ARRAY';
        my @compat = grep { my $element = $_defined $element && ( $element eq $key || grep { $element eq $_ } @{$compat{$key}} ) } @combo;
        return 0 if @compat > 1;
    }
    return 1;
}

Making the inner grep a List::Util::first instead seems obvious, but the added overhead made it not worth it for the small data set. Removing our guard on the other hand halved execution time, so I have removed it in production.  Who knew ref( ) was so slow?  Next, I "disengaged safety protocols" by turning off warnings and killing the defined check.  This made no appreciable difference, so I still haven't yet run into a situation where I've needed to turn off warnings in a tight loop.  Removing the unnecessary allocation of @compat and returning directly shaved another 200ms.  All told, I got down to 800ms, which is in "detectable but barely" delay territory, which is good enough in my book.

Conclusion

The thing I take away from all this is that the most useful thing a mathematics education teaches is the ability to identify specific problems as instances of generalized problems (to which a great deal of thinking has already been devoted).  While this is not a new lesson, I continuously astonish myself how unreasonably effective it is.  That, and exposure to the wide variety of pursuits in mathematics gives a leg up as to where to start looking.

I also think the model I took developing this has real strength.  Developing a program while simultaneously doing what amounts to a term paper on how it's to operate very clearly draws out the constraints and acceptance criteria from a program in an apriori way.  It also makes documentation a fait accompli.  Making sure to test and profile while doing this as well completed the (as best as is possible without users) methodologically dual design, giving me the utmost confidence that this program will be fit for purpose.  Given most "technical debt" is caused by not fully understanding the problem when going into writing your program (which is so common it might shock the uninitiated) and making sub-optimal trade-offs when designing it, I think this approach mitigates most risks in that regard.

That said, it's a lot harder to think things through and then test your hypotheses than just charging in like a bull in a china shop or groping in the dark.  This is the most common pattern I see in practice doing software development professionally.  To be fair, it's not like people are actually willing to pay for what it takes to achieve real quality, and "good enough" often is.  Bounded rationality is the rule of the day, and our lot in life is mostly that of a satisficer.  Optimal can be the enemy of good, and the tradeoffs we've made here certainly prove this out.

When I was doing QA for a living people are surprised when I tell them the most important book for testers to read is Administrative Behavior. This is because you have to understand the constraints of your environment do do your job well, which is to provide actionable information to decision-makers.  I'm beginning to realize this actually suffuses the entire development process from top to bottom.

25 most recent posts older than 1618336942
Size:
Jump to:
POTZREBIE
© 2020-2023 Troglodyne LLC