PlaidCTF 2015 Writeup: RE GEX

I participated in PlaidCTF 2015 with WPI’s team. It was a very challenging CTF. In fact, we only managed to solve one real challenge. Since most of the effort on this problem was mine, I decided to do a writeup.

The challenge provided two files: a python server program and a crazy-huge regex (see regex.txt). The python server is very basic: it simply listens for a connection, prompts for a ‘key’, and then checks whether the key matches the regex. It will then send the flag, but only if the provided key did NOT match the regex.

After the initial shock of looking at many screenfulls of regex, it actually breaks down into simple chunks. The regex consists of a single group which must match the entire string (^(...)$), containing a number of alternatives. Because we need to find a string which does not match the regex, we need to ensure that none of the alternatives match our string. The first alternative (.*[^plaidctf].*) will match any string which contains any character not in ‘plaidctf’, so we know our solution will contain only those characters. The second two alternatives (.{,170}|.{172,}) will match any string which is 170 characters or shorter and any string which is 172 characters or longer, respectively, so we know that our solution must be 171 characters long.

Each of the following alternatives matches a 171-character string where certain specified characters are in certain sets. For instance, the first of these (.{88}[padt].{60}[licf].{6}[plai].{14}) matches any 171-character string where the 89th character is in ‘padt’, the 149th character is in ‘licf’, and the 165th character is in ‘plai’. Because we need our solution to not match any alternative, we therefore know that either the 89th character of the solution is not in ‘padt’ OR the 149th character is not in ‘licf’ OR the 165th character is not in ‘plai’.

The problem of finding an input which satisfies all of these constraints reduces easily to SAT, a well-studied NP-hard problem for which off-the-shelf solvers are plentiful. SAT is the problem of finding values for all of the variables in a boolean formula so that the value of the formula is true. To reduce our problem to SAT we need to put it all into boolean form.

One way to do this is to create 8 variables for each character in the solution string (8 * 171 = 1368 variables for the entire problem), each representing that character being one of the characters in ‘plaidctf’. So for instance, the variable A might represent the statement “the character at index 0 in the solution is ‘p’”, B might represent the statement “the character at index 0 in the solution is ‘l’”, and so on. For each set of 8 variables (concerning a single character), it would be necessary to add the constraint that exactly one of those variables can be true. (An example of using a similar technique to apply a SAT solver to sudoku problems can be found here.)

I wrote a python script which parses the regex, generates boolean constraints using this technique, puts the constraints into CNF, and then attempts to solve the problem with a SAT-solving engine. However, after leaving this running overnight and getting no results, it began to seem unlikely that this was the right solution. How can we make it faster?

Well, we’re using 8 boolean values to represent what is actually 3 bits of data (there are 8 possible characters at each position, so it should only take 3 boolean variables to represent that). Suppose instead of using 8 values for each character, we use three values, corresponding to the bits of the index of that character in the string ‘plaidctf’?

p 000
l 001
a 010
i 011
d 100
c 101
t 110
f 111

So for each character we have three boolean variables. If they are all false that corresponds to the character being ‘p’. If the first two are false and the third is true, the character is ‘l’, and so on.

At this point I noticed an interesting pattern in the constraints. Each time there is a set of characters, it is almost always four characters. The four characters always appear in order of their indices in ‘plaidctf’, and there seems to be some repetition of character sets. In fact, there are only six four-character sets in all, and the characters in each set all share a common bit-value:

plai (first bit 0)
dctf (first bit 1)
pldc (second bit 0)
aitf (second bit 1)
padt (third bit 0)
licf (third bit 1)

So each of these character sets actually only depends on a single bit in the character at that position (a single variable in our boolean system)!

There are also some character sets which are only two characters long (for instance ‘dt’). In these cases the set provides information on two bits. The set ‘dt’ for instance, can be easily represented logically as bit 1 AND NOT bit 3.

(Note, we need the regex to NOT match, so the character set ‘plai’ appearing in the regex actually means that the first bit of that character in our solution must be 1 (for instance). In the example two-character set ‘dt’, the inverse expression becomes NOT bit 1 OR bit 3.)

Because each character set now translates to a clause which depends only on one or two variables, and the set of variables is significantly smaller, the SAT problem generated by this method is much simpler than the previous one. The script (solution.py) took under 30 minutes to produce the string:

cddliadtatdddcfidpfatdaccafddiadpltdicdfldcltiftpdafpaddfdcddipappfdptapiptpatipccllpttpcitpdpdtapptfcppfdftccfiapctdallcitaadlfiatfpfpdiidltpacipdctcapfiddftcpalppidlpilp

Connecting to the server at the address provided in the challenge and entering that string yielded the flag (flag{np_hard_reg3x_ftw!!!1_ftdtatililactldtadf}).

solution.py

This is the script I used to crack the regex.

# !/usr/bin/env python2
import multiprocessing

from pycryptosat import Solver  # must have cryptominisat4 installed and run in python2

# get the segments of the regex, ignoring the first three (which limit the character set and length)
regex_segments = open('regex.txt', 'r').read().strip()[34:][:-2].split('|')

# Iterate through the regex segments and generate clauses. Each constraint is a list of tuples (index charset), where
# index is an index into the solution string and charset is the set of chars which the regex would match at that index
constraints = []
for segment in regex_segments:
    index = 0
    iterator = iter(segment)
    constraint = []
    try:
        while True:
            char = next(iterator)
            if char == '.':
                # the character at the current index can be anything, skip it
                index += 1
            elif char == '{':
                # parse a number of repetitions for '.'
                num_str = ''
                char = next(iterator)
                while char != '}':
                    num_str = num_str + char
                    char = next(iterator)
                num = int(num_str)
                index += num - 1  # - 1 for the . character, for which we already incremented the index
            elif char == '[':
                # parse the charset and add it to the constraint
                charset = ''
                char = next(iterator)
                while char != ']':
                    charset += char
                    char = next(iterator)
                constraint.append((index, charset))
                index += 1
            else:
                raise Exception("Unexpected character encountered in regex")
    except StopIteration:
        constraints.append(constraint)


def variables(index, charset):
    """
    Given an index into the solution string and a set of characters from the regex, returns the number of each boolean
    variable which that charset would imply (positive if it's implied to be 1, negative for 0).
    """
    ones = 0b111
    zeros = 0b111
    for char in charset:
        char_index = 'plaidctf'.index(char)
        ones &= char_index
        zeros &= ~char_index
    for bit_index in range(3):
        var_number = (index * 3) + bit_index + 1
        if (ones >> bit_index) & 0b001:
            yield -var_number  # Negative for ones because the character sets must not be matched
        elif (zeros >> bit_index) & 0b001:
            yield var_number


# Now we convert the constraints into CNF so they can be put into cryptominisat
cnf = []

# create constraints for regex segments
for constraint in constraints:
    row = set()
    for index, charset in constraint:
        for variable in variables(index, charset):
            row.add(variable)

    cnf.append(list(row))

s = Solver(threads=multiprocessing.cpu_count())
for line in cnf:
    s.add_clause(line)
sat, solution = s.solve()
if sat:
    # Convert the solution from a list of booleans to a string
    solution = [solution[i + 1] if i + 1 < len(solution) else False for i in
                range(171 * 3)]  # pad solution out to expected number of variables
    s = ''
    for i in range(171):
        b1, b2, b3 = solution[i * 3], solution[i * 3 + 1], solution[i * 3 + 2]
        ci = 0
        if b1:
            ci += 1
        if b2:
            ci += 2
        if b3:
            ci += 4
        s += 'plaidctf'[ci]
    print(s)
else:
    print("Unsatisfiable")

regex.txt

This file is too huge to reproduce here, so you can see it here instead.

Written on April 20, 2015