When preparing any tool which you see all the pieces readily available, but that nobody has executed upon, you begin to ask yourself why that is. This is essentially what I've been going through building the pairwise tool.
Every time I look around and don't see a solution for an
old problem on CPAN, my spider-senses start to fire. I saw
no N-dimensional combination methods (only n Choose k) or bin
covering algorithms, and when you see a lack of N-dimensional
solutions that usually means there is a lack of closed form
general solutions to that problem. While this is not true
for my problem space, it rubs right up against the edge of NP hard
problems. So it's not exactly shocking I didn't see anything
fit to purpose.
The idea behind pairwise
test execution is actually quite simple, but the constraints
of the software systems surrounding it risk making it more complex
than is manageable. This is because unless we confine ourselves to
a very specific set of constraints, we run into not one, but two
NP hard problems. We could then be forced into the unfortunate
situation where we have to use Polynomial time approximations.
I've run into this a few times in my career. Each time the team
grows disheartened as what the customer wants seems on the surface
to be impossible. I always remember that there is always a way to
win by cheating (more tight constraints). Even the tyranny of the
rocket equation was overcome through these means (let's put a
little rocket on a big one!)
The first problem is that N-Wise test choosing is simply a combination.
This results in far, far more platforms to test than is practical once you get beyond 3 independent variables relevant to your system under test. For example:
A combination with 3 sets containing 3, 5 and 8 will result in 3
* 5 * 8 = 120 systems under test! Adding in a fourth or fifth will
quickly bring you into the territory of thousands of systems to
test. While this is straightforward to accomplish these
days, it is quite expensive.
What we actually want is an expression of the pigeonhole
principle. We wish to build sets where every
element of each component set is seen at least once, as
this will cover everything with the minimum number of needed
systems under test. This preserves the practical purpose of
pairwise testing quite nicely.
In summary, we have a clique problem and a bin covering problem. This means that we have to build a number of bins from X number of sets each containing some amount of members. We then have to fill said bins with a bunch of tests in a way which will result in them being executed as fast as is possible.
Each bin we build will represent some system under test, and each set from which we build these bins a particular important attribute. For example, consider these sets:
A random selection will result in an optimal multi-dimensional "pairwise" set of systems under test:
The idea is to pick one of each of the set with the most members
and then pick from the remaining ones at the index of the current
pick from the big set modulo the smaller set's size. This is the
"weak" form of the Pigeonhole Principle in action, which is why it
is solved easily with the Chinese
You may have noticed that perhaps we are going too far with our constraints here. This brings in danger, as the "strong" general form of the pigeonhole principle means we are treading into the waters of Ramsey's (clique) problem. For example, if we drop either of these two assumptions we can derive from our sets:
We immediately descend into the realm of the NP hard problem.
This is because we are no longer a principal ideal domain and can
no longer cheat using the Chinese remainder theorem. In this
reality, we are solving the Anti-Clique
problem specifically, which is particularly nasty. Thankfully, we
can consider those two constraints to be quite realistic.
We will have to account for the fact that the variables are
actually not independent. You may have noticed that some of these
"optimal" configurations are not actually realistic. Many
Operating systems do not support various processor architectures
and software packages. Three of the configurations above are
currently invalid for at least one reason. Consider a
configuration object like so:
Can we throw away these configurations without simply
"re-rolling" the dice? Unfortunately, no. Not without
using the god
algorithm of computing every possible combination ahead of
time, and therefore already knowing the answer. As such our
final implementation looks like so:
This brings us to another unmentioned constraint: what happens if
a member of a set is incompatible with all members of another
set? It turns out accepting this is actually a significant
optimization, as we will end up never having to re-roll
an entire sequence. See the while loop above.
Another complication is the fact that we will have to randomize
the set order to achieve the goal of eventual coverage of every
possible combination. Given the intention of the tool is to run
decentralized and without a central oracle other than git,
we'll have to also have use a seed based upon it's current
state. The algorithm above does not implement this,
but it should be straightforward to add.
We at least have a solution to the problem of building the bins. So, we can move on to filling them. Here we will encounter trade-offs which are quite severe. If we wish to accurately reflect reality with our assumptions, we immediately stray into "no closed form solution" territory. This is the Fair Item Allocation problem, but with a significant twist. To take advantage of our available resources better, we should always execute at least one test. This will result in fewer iterations to run through every possible combination of systems to test, but also means we've cheated by adding a "double spend" on the low-end. Hooray cheating!
The fastest approximation is essentially to dole out a number of
tests equal to the floor of dividing the tests equally among the
bins plus floor( (tests % bins) / tests ) in
the case you have less tests than bins. This has an error which is
not significant until you reach millions of tests. We then get
eaten alive by rounding error due to flooring.
It is worth noting there is yet another minor optimization in our
production process here at the end, namely that if we have more
systems available for tests than tests to execute, we can achieve
total coverage in less iterations by repeating tests from earlier
Obviously the only realistic assumption here is #2. If tests can be executed faster by breaking them into smaller tests, the test authors should do so, not an argument builder.
Assumptions #1 and #3, if we take them seriously would not only doom us to solving an NP hard problem, but have a host of other practical issues. Knowing how long each test takes on each computer is quite a large sampling problem, though solvable eventually even using only git tags to store this data. Even then, #4 makes this an exercise in futility. We really have no choice but to accept this source of inefficiency in our production process.
Invalidating #5 does not bring us too much trouble. Since we expect to have a number of test hosts which will satisfy any given configuration from the optimal group and will know how many there are ahead of time, we can simply split the bin over the available hosts and re-run our bin packer over those hosts.
This will inevitably result in a situation where you have an
overabundance of available systems under test for some
configurations and a shortage of others. Given enough tests, this
can result in workflow disruptions. This is a hard problem to
solve without "throwing money at the problem", or being more
judicious with what configurations you support in the first place.
That is the sort of problem an organization wants to have though.
It is preferable to the problem of wasting money testing
everything on every configuration.
Since the name of the tool is pairwise, I may as well also
implement and discuss multi-set combinations. Building these
bins is actually quite straightforward, which is somewhat shocking
given every algorithm featured for doing pairwise testing at
pairwise.org was not in fact the optimal one from my 30 year old
combinatorics textbook. Pretty much all of them used
tail-call recursion in languages which do not optimize this, or
they took (good) shortcuts which prevented them from functioning
in N dimensions.
Essentially you build an iterator which, starting with the first
set, pushes a partial combination with every element of its set
matched with one of the second onto your stack.
You then repeat the process, considering the first set to be the partial, and crank right through all the remaining sets.
Dealing with incompatibilities is essentially the same procedure
as above. The completed algorithm looks like so:
You may have noticed this is a greedy algorithm. If we
decided to use this as a way to generate a cache for a "god
algorithm" version of the anti-clique generator above, we could
very easily run into memory exhaustion with large enough
configuration sets, defeating the purpose. You could flush the
partials that are actually complete, but even then you'd only be
down to 1/n theoretical memory usage where n is the size of your
2nd largest configuration set (supposing you sort such that it's
encountered last). This may prove "good enough" in practice,
especially since users tend to tolerate delays in the "node added
to network" phase better than the "trying to run tests"
phase. It would also speed up the matching of available
systems under test to the desired configuration supersets, as we
could also "already know the answer".
Profiling this showed that I either had to fix my algorithm or
resort to this. My "worst case" example of 100 million tests
using the cliques() method took 3s, while generating everything
took 4. Profiling shows the inefficient parts are almost
100% my bin-covering.
Almost all of this time is spent splice()ing huge arrays of
tests. In fact, the vast majority of the time in my test
(20s total!) is simply building the sequence (1..100_000_000),
which we are using as a substitute for a similar length argument
array of tests.
We are in luck, as once again we have an optimization suggested
by the constraints of our execution environment. Given any
host only needs to know what it needs to execute we can
save only the relevant indices, and do lazy
evaluation. This means our sequence expansion (which
takes the most time) has an upper bound of how long it takes to
generate up to our offset. The change is
The question is, can we cheat even more by starting at our offset
too? Given we are expecting a glob or regex describing a
number of files which we don't know ahead of time what will be
produced, this seems unlikely. We could probably speed it up
every other sieve trick we can try (see DeMorgan's
Laws) is already part of the C library implementing glob
itself. I suspect that we will have to understand the parity
problem a great deal better for optimal
seeking via search criteria.
Nevertheless, this gets our execution time for the cliques()
algorithm down to 10ms, and 3s as the upper bound to generate our
sequence isn't bad compared to how long it will take to execute
our subset of 100 million tests. We'd probably slow the
program down using a cached solution at this point, not to mention
having to deal with the problems inherent with such.
Generating all combinations as we'd have to do to build the cache
itself takes another 3s, and there's no reason to punish most
users just to handle truly extreme data sets.
It is possible we could optimize our check that a combination is
valid, and get a more reasonable execution time for combine() as
well. Here's our routine as a refresher:
Making the inner grep a List::Util::first instead seems obvious,
but the added overhead made it not worth it for the small data
set. Removing our guard on the other hand halved execution time,
so I have removed it in production. Who knew ref( ) was so
slow? Next, I "disengaged safety protocols" by turning off
warnings and killing the defined check. This made no
appreciable difference, so I still haven't yet run into a
situation where I've needed to turn off warnings in a tight
loop. Removing the unnecessary allocation of @compat and
returning directly shaved another 200ms. All told, I got
down to 800ms, which is in "detectable but barely" delay
territory, which is good enough in my book.
The thing I take away from all this is that the most useful thing
a mathematics education teaches is the ability to identify
specific problems as instances of generalized problems (to which a
great deal of thinking has already been devoted). While this
is not a new lesson, I continuously astonish myself how
unreasonably effective it is. That, and exposure to the wide
variety of pursuits in mathematics gives a leg up as to where to
I also think the model I took developing this has real
strength. Developing a program while simultaneously doing
what amounts to a term paper on how it's to operate very clearly
draws out the constraints and acceptance criteria from a program
in an apriori way. It also makes documentation a fait
accompli. Making sure to test and profile while doing this
as well completed the (as best as is possible without users) methodologically
dual design, giving me the utmost confidence that this
program will be fit for purpose. Given most "technical debt"
is caused by not fully understanding the problem when going into
writing your program (which is so common it might shock the
uninitiated) and making sub-optimal trade-offs when designing it,
I think this approach mitigates most risks in that regard.
That said, it's a lot harder to think things through and then
test your hypotheses than just charging in like a bull in a china
shop or groping in the dark. This is the most common pattern
I see in practice doing software development professionally.
To be fair, it's not like people are actually willing to pay
for what it takes to achieve real quality, and "good enough" often
rationality is the rule of the day, and our lot in life is
mostly that of a satisficer.
Optimal can be the enemy of good, and the tradeoffs we've made
here certainly prove this out.
When I was doing QA for a living people are surprised when I tell
them the most important book for testers to read is Administrative
Behavior. This is because you have to understand the
constraints of your environment do do your job well, which is to
provide actionable information to decision-makers. I'm
beginning to realize this actually suffuses the entire development
process from top to bottom.