Any comments and criticisms welcome
An approach from compressed sensing seems to provide a range from 69.96bits to 171.72bits:
1.)Storing the puzzle implies storing the solution (information theoretically).
2.)The hardest sudoku puzzle seems to have t(α)α2 entries for some t(α) that depends on α (For example, t(3) =2.44444 to 3). http://www.usatoday.com/news/offbeat/2006-11-06-sudoku_x.htm
Hence, we have a vector P of length α4 that have atmost t(α)α2 non-zero entries.
3.) Take M, a β×α4 matrix with β≥2t(α)α2 and which has any 2t(α)α2 columns independent and with entries in {0,±1}. This matrix is fixed for all instances of the puzzle. β=kt(α)α2 for some fixed k suffices from UUP.
4.) Find V=MP. This has β integers which on average is bounded by |α2| since entries of M are random with entries in {0,±1}.
5.) Storing V needs βlogα2=2kt(α)α2logα bits.
In your case, α=3 and t(α) =3 and 2kt(α)α2logα=69.96kbits to 85.86k bits. k=2, the minumum required provides roughly 139.92bits to 171.72bits roughly as a lower bound for the average case.
Note that I have hand-waived some assumptions such as sizes of entries of MP and number of entries one has on average in the puzzle.
A.)Of course, it mightbe possible to reduce k from 2 since in sudoku the position of the sparse entries are not that mutually independent. Each entry on an average t(α)−1 entries each in its row, column and sub-box. That is given, that some entries are present in a sub-box or column or row, one can find the odds of entries being present in the same row, column or sub-box.
B.) Each row, column or sub-box is assumed to have on an average t(α) non-zero entries with no-repeating alphabet. This means some types of vectors with t(α) non-zero entries will never occur, thereby reducing the search space of solutions. This could also reduce k. For instance, fixing t(α) entries in a sub-box, a row and a column would reduce the search space from α4Ct(α)α2 to α4−(3α2−1)Ct(α)α2−3t(α).
A comment: May be a multi-user arbitrarily correlated Slepian-Wolf model will help make the entries independent while still respecting the atmost t(α)α2 non-zero entries criterion. However, if one could use it, one need not have gone through the compressed sensing route. So applicability of Slepian-Wolf might be hard.
C.)From an error correction analogy, an even significant reduction may be possible, since in higher dimensions, there could be gaps between the half-the-minimum-distance radii hamming balls around code points with a possibility to correct greater errors. This also should lead to reduction of k.
D.) V itself can be entropy compressed. If the entries of V are quite similar in sizes, then can we assume that the difference between any two of the entries is atmost O((√Vmax))=O(|α2|−−−√)? Then if encoding the differences between the entries suffices, this itself will remove the factor 2 in βlogα2=2kt(α)α2logα.
It would be interesting to see if 2k can be made equal or less than 2 using A.), B.), C.) and D.). This would be better than 89 bits (which is the best so far in other answers) and for the best case better than the absolute minimum for all puzzles which is around 73bits.