TELEPHONE:

EMAIL:

Information

Data Validation Using The Md5 Hash

by Brian Deering

Most law enforcement computer forensic specialists rely upon mathematical validation to verify that the restored mirror image of a computer disk drive and relevant files exactly match the contents of the original computer. Such comparisons help resolve questions that might be raised during litigation about the accuracy of the restored mirror image. They also act as a means of protection for the computer forensic specialist concerning allegations that files were altered or planted by law enforcement officials during the processing of the computer evidence. In the past 32 bit algorithms were used for this purpose and programs such as CRCHECK and CRC32 became popular. More recently it has become necessary to use more accurate mathematical calculations for this purpose, i.e. 128 bit hashes. The reasons are tied to the potential for brute force attacks using todays powerful desktop computers and also the volume of files that exist on contemporary computer hard disk drives. It is not uncommon to find over 100,000 files stored on computer hard disk drives today and the storage capacity seems to increase every few months due to advances in technology. NTI has adopted the use of the RSA MD5 algorithm for this purpose and it has been incorporated in NTI's FileList Pro, CrcMD5 and DiskSig Pro programs. FileList Pro and DiskSig Pro also provide the capability to create more accurate hashes using the U. S. NIST tested SHA algorithm.

Questions have recently been raised regarding the accuracy of 128 bit hashes and the need for such programs in computer processing. One such question was recently raised in a law enforcement Internet list server group by Fred Kerr. Fred is a retired U. S. Air Force, OSI, Special Agent who has a background in forensics and computer evidence processing. Brian Deering provided the following response which provides good food for thought regarding the issue. Brian's comments follow:

This is a toughie but one that intrigues me, for obvious reasons.

Unlike calculating the simple probability of a random event, say the flip of a coin where the input (the coin) and the possible outcomes (actually three, if you consider that it could land on the edge) are a pretty well defined the calculation of the MD5 hash value is NOT a random process. Therefore, since the possible inputs are infinite and the possible outputs are finite and the actual output is formulaically derived and not a random event simple statistical probabilities will probably fail.

Consider the following, however. The MD5 creates a numeric representation of the contents of a message and displays it as a 16 character hexadecimal value. Theoretically, there are 2^128 possible MD5 hash values. Therefore, if you have a file system with 2^128+1 different files on it I can guarantee you that there are at least two different files that will generate duplicate hash values.

Consider the reality of 2^128 possibilities. There are actually 3.402 x 10^38 or 340 billion billion billion billion or a little more than 1/3 of a googol possibilities. When you consider that most people have never seen a million of anything the actual number becomes really difficult to conceptualize. (Leave it to my nine year old to teach me that a googol, which really exists, is actually 10^100.)

Consider the statement in a 1992 MIT document http://theory.lcs.mit.edu/~rivest/Rivest-MD5.txt that says, "It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given pre-specified target message digest." Note the wording carefully, "computationally infeasible", not impossible. Some would argue that ca. 1992, when a 16 MHZ 80486 was a high speed computer, this statement was probably accurate but that today, in the era of 400 MHZ Pentiums, Super Crays and heaven knows what else folks have stuffed into there basement closets, it's no longer INFEASIBLE.

So, in the strictest sense, if you had sufficient computing capability (and I don't know who that would be but I can pretty well bet that no one willing to associate with the likes of you and me has this) then you could conceivably create two different documents, presumably one that inculpates a defendant, with identical hash values. Is this a deficiency of the MD5? Is this a potential problem for courtroom purposes? No, on both questions!!! With any evidence, isn't it possible to tamper with it in such a way as to create a desired output? The obvious response to any questions along this line is, "Yes, it is possible to intervene and effect a possible outcome." and "No, I did not tamper with the evidence."

So, now the more germane question. Could an accidental change occur to data in such a manner that two different files generated identical hash values. The answer is a clear, "Yes." Again, there are a finite number of possible hash values and an infinite number of possible files.

There are actually two interacting probabilities working here. First, what's the probability of any specific document being generated by a random occurrence and second, what's the probability that the resulting document would generate a hash value identical to an already existing file.

First. Examine the probability of a specific accidental change occurring. Say an electrostatic discharge scrambles your data. You finished with a document X characters long, there are 256 different possible outcomes for each character, the possible outcomes for a scrambled document of X characters then is 256^X. The probability then of randomly creating a document that says (without quotes), "I did it." is 256^9 or 1:4,722 X 10^24.

Second. Consider the possibility of a duplicate hash value occurring randomly. I've stated earlier that this is hard to calculate because the MD5 is formulaic rather than random. So let's assume, and there is nothing in the literature anywhere I've found to suggest this, that the MD5 is flawed in some way and doesn't generate all of the 2^128 possible outcomes. Say it only generates even numbers and let's say it doesn't do that very well and only generates half of the possible even numbers. (We're now down to one fourth of the total possible values.) The possible outcomes is now a mere 2^126 or 8.507 x 10^37. Conservatively, the probability of different files generating duplicate hash values is minuscule. Let's go with 1:8.507 x 10^37. I believe that one fourth of the total possible outcomes is a pretty conservative estimate but if you don't think so feel free to use whatever fractional amount you choose, it won't materially change the final outcomes.

NOW. Let's synthesize. Initially it's really unlikely (1:256^X, where X represents the total characters in the document) that any random change will create any specific document. Then, consider the probability that the randomly created document would generate an identical hash value (1:8.507 x 10^37) and you get the following: The probability that a random change could occur that created any specific document of X number of characters and that the newly created document would create a specific hash value is (1:256^X) * (1:8.507 x 10^37). Now, in my advanced years I'll admit that my brain is a little addled but I think that calculates out to 1:2,177 x 10^(37+X).

If you accept the logic to this point, your challenge, should you accept it, is to relate this to someone on a jury. How about this as a comparison. In Pennsylvania the probability of me winning the Super 6 Lotto with any one ticket is 1:39,000,000. Using the 9 character document described above, I will win the Pennsylvania Super 6 Lotto about 5.582 x 10^41 (or 558,205 billion billion billion billion) times before these two events occur. The probabilities become even more astronomical as the document of X characters increases in length.

So, here's the summary. It's all moot anyway. Because I'm only going to win the Lottery once and then retire.

Fred, are you a betting man?

Brian R. Deering is the Computer Forensics Program Manager at the National Drug Intelligence Center, 319 Washington St., 5th floor, Johnstown, PA 15901 (814) 532-4930 and we appreciate him giving permission to publish his comments.