ok, for a little explanation.
i thought i generated all 256 characters per OEM code page and mapped out the resulting LM hashes. But my mistake was that i generated the first 256 unicode characters. Although this is is useful character set, OEM code pages also contain characters from higher unicode characters.
So i started again from the start, taking in short these steps:
* Per code page, get a list of bytes and their respective unicode bytes (2 bytes). These are available for example from the following URL:
http://www.microsoft.com/globaldev/reference/oem/850.mspx* Switch the default OEM code page in Windows and generate LM hashes for all 256 characters of that OEM code page
* Collect all the hashes and take the unique LM hashes and check which bytes crack these characters. These bytes will form the final charset that rcrack will need to crack the LM hashes.
* Generate all LM hashes for all 65536 unicode characters. I did this for cp850, and found that all characters indeed map to one of the characters in the code page (if they map at all). It is however important to get a full list of all the mapped characters, to be able to quickly bruteforce the matching NTLM hash.
So i did this for the following code pages (of these i consider 437 (US) and 850 (European) currently the most valuable for now):
• 437 (US)
• 720 (Arabic)
• 737 (Greek)
• 775 (Baltic)
• 850 (Multilingual Latin I)
• 852 (Latin II)
• 855 (Cyrillic)
• 857 (Turkish)
• 862 (Hebrew)
• 866 (Russian)
• 874 (Thai)
• 932 (Japanese Shift-JIS)
• 1258 (Vietnam)
In total, all the code pages make up for 230 unique LM hashes. This makes sense, as there are 256 possible input bytes, and in all the OEM code pages lowercase alpha gets converted to uppercase alpha, so you have max 256-26=230 possible input bytes. It is funny to see that you can have different LM hashes for equal characters in different OEM code pages

It won't get you into trouble that easy, because nowadays most systems use NTLM for actual password verification anyway.
So this all seems nice, but 230 bytes and 1-7 in length is for now FAR too much for us to generate rainbow tables for. We can skip the first 32 (including 0x0000, no password) characters (control) characters for every code page, as these are not realistic to be used in passwords. So then we are left with a character set of 198 bytes, which is still far too large. It might however be interesting to generate 1-6 as this might give quite some success on cracking the second LM hash (and the first, for really short passwords

).
So what could we do to get useful character sets is split up per code page. But as i now mapped out all the characters per code page, there are still too many unique lmhashes per code page. For example (without the first 32):
* cp437: 171
* cp850: 167
Combined, these make up for 173 unique bytes. If we dump again the blocks and lines we are just as unrealistic to be used in a password, we leave away characters from these unicode blocks:
* Box Drawing (U+2500 - U+257F)
* Block Elements (U+2580 - U+259F)
* Geometric Shapes (U+25A0 - U+25FF)
Then we have these 123 bytes for cp437 (contains many box/block) and 139 bytes for cp850. This is probably still too much, although scoobz might shine his calculative light upon this

I calculate a keyspace of around 2^48.6 (might be possible?) for cp437 and 2^49.8 for cp850 (too much i gues).
We could strip away some more characters that are not likely to be used in passwords, so we are about to end up with just those characters that people use in real text, like characters with accents and such

* We could at least skip 0x007F (DELETE) and 0x00A0 (No Breaking Space).
* We could skip mathematical and technical characters like: ⌠⌡≈°∙√ⁿ unicode blocks: Mathematical Operators (U+2200 - U+22FF) and Miscellaneous Technical (U+2300 - U+23FF)
* We could manually pick 'useless' characters, i picked these:
cp437: ªº¿¬½¼¡«»±÷·²
cp850: תº¿®¬½¼¡«»©¦¯±‗¾¶§÷¸°¨·¹³²
* We could skip some currency symbols, but i think we're small enough now (would save a couple more).
So now i'm down to 96 bytes for cp437 and 108 for cp850. Combined they take up 114 bytes.
I would especially like some comments on the characters i manually removed, feel free to comment.
The numbers here could change a little, because some characters that are not in the specific OEM code page still map to these. So i need to verify those 65536 for all code pages and see if any other useful characters are removed because of this selection. Edit: I just did for cp850, no relevant characters were removed.
p.s. to see the above special characters correctly, you might need to set your browser to unicode/UTF-8

edit2: see picture to know for sure
