File Encoding and Detection
Samuel Ashmore
CS489-02: Digital Forensics September 19, 2006
Introduction
One of the most important aspects of understand how a file is interpreted is the
encoding of the file. This encoding affects the way in which the file is understood, and possibly how it relates to other files. The encoding of a file is the translation of a file from one format to another. One of the main ways this affects an investigation is the encoding specifies how the raw binary file format is brought into user displayable content. There are four main categories of encoding: Character, Text, Semantics, and Medium Optimization. Each of these affects how much information a file holds for an examiner and how it will push an investigation. 1
Character Encoding
One of the most basic and longest existing encodings in the computer world is
character encoding. This encoding dictates how individual characters in a simple text file are stored. This format is usually done without storing the exact format method in the data, but leaving it up to the end user to recover. There are three main methods for determining file encoding.
1 Wikipedia contributors, \"Encoding,\" Wikipedia, The Free Encyclopedia
The first is Code Scheme, where a character set is treated as containing a series of codes which independent of the final meaning exist in that encodings scheme. The text is analyzed for certain codes that don’t belong in the guessed character encoding. If a particular code, which exists in no other code scheme, is found a match is generated. If the answer is inconclusive at the end of this, the data is then fed into another format test. 2
The next is character distribution. This deals with the fact that different languages contain certain letters, which are more frequent than other characters. This method compares the distribution in the file against known distributions. This method runs fairly quickly, however, it can become confused when single byte encodings are used instead of multi-byte encodings. 3
The third method is 2 character sequence distribution. This method is an extension of the previous method and starts to use the patterns in the language to derive the character encoding. The distributions of letters that come out of this are extremely encoding specific and allow for better prediction of the encoded data. This method is the most efficient for single byte encodings, although, it is not as good as the character distribution for multi-byte characters. One of the major drawbacks to this method is that it only applies to average text, and anything that is weighted towards a specific set of characters, that are not in its average, can cause it to incorrectly interpret text. 4
3 Duerst, Martin. 1977 4 Duerst, Martin. 1977.
2 Duerst, Martin. 1977. The Properties and Promizes of UTF-8. 11th Unicode Conference.
Text Encoding
One of the most widely used encodings; Text Encoding is one of the most
content rich encodings of data. The text encoding extends the classic text file, and includes formatting data, in addition to other data, which can help to elaborate the data. The three types of text encoding are presentational, procedural, and descriptive. 5
Presentational encoding is the most basic of these three. This type of text
encoding deals with expressing structure of a document such as titles and headers. This format does not express anything about the format of the text in terms of adding emphasis to characters. This is the most basic of the three types of text encoding.
Procedural encoding is the most widely recognized of the three text encoding
methods. This method expresses ideas such as color and different sizes for text. This formatting that is added to the document usually must be visible to the user when the user is creating the file, however it is not always completely necessary. This encoding includes file types such as postscript.
Descriptive encoding is the final kind of text encoding. This encoding stores data
about elements. This format doesn’t necessarily deal with display of the data, but merely helps to relate pieces of data to each other. An example of this type of encoding is XML where data is stored using tags that show how that piece of data relates to others in its category.
Data can be formatted in multiple text encoding categories. One example of this
is html, which stores data about the color, and size of text, while also storing data about
5 Coombs, James H. et al. 1987 Markup systems and the future of scholarly text processing
how the document links to other documents. The main portion of the encoding scheme deals with expanding the information presented by the character encoding. Usually the Text Encoding will store information relevant to the format of the character in it as part of
its header information.
Semantic Encoding
Semantic Encoding is where data is transferred from one format to another
format without the loss of data, and is reversible. This is important since not all computers and users can understand data in any format. Because of this the data must be transmitted, or interpreted into a new format. This change of formatting occurs in activities such as printing a document, or compiling code. 6
Medium Optimization Encoding
Medium Optimization encoding is similar to the Semantic encoding except that
the encoding does not need to be reversible, and usually loses some data. The purpose behind this encoding is that in order to fit data through a pipe, often times, less important data can be lost in order to better utilize resources. An example of this formatting is JPEG compression. This allows for data from a bitmap to be reduced in size but with a significant loss of data that is not visible to the user. Because of this loss of data, the picture is stored and transmitted in a resource limited environment much better.
Unicode Conference
6 Li, Shanjian 2002. A composite approach to language/encoding detection. 19th International
Affect of File Encoding on Forensic Investigations
File encoding can affect the interpretation of data by the forensic examiner. This
can be due to the examiner, not understanding the format, and thus being unable to use a piece of data, or being able to understand the data, and use it as either evidence or a key to more evidence. File encoding can also help to relate data already found in a case to other pieces of data, and help to draw a picture or timeline for the investigator. An example of this is if the examiner finds a picture on the hard drive that is a bitmap, and finds another picture with the same content that is a jpeg. If the least significant bits of the data are different then the examiner can start to understand if the bitmap is an original or if the suspect has hidden data in it.
File encoding also affects the prosecutors and the jury. If the jury doesn’t
understand how the encoding of the file was found, or if they don’t understand how data was extracted from this encoding, important evidence can be dismissed, and even show the prosecutor in bad light, hurting any evidence gained by the investigation.
Future Developments of File Encoding
The file encoding field is work towards the goals of more optimized data storage,
and files containing more and more descriptive data. This will affect future law enforcement investigations by requiring a more diverse understanding of file encoding, but at the same time will present the examiner with more potential leads in a case. This may also increase the amount of data the examiner has to process since the medium optimized data can potentially store many more pictures, and videos in a smaller space.
Bibliography
Coombs, James H. et al. 1987 Markup systems and the future of scholarly text
processing Communications of the ACM http://xml.coverpages.org/coombs.htmlDuerst, Martin. 1977. The Properties and Promizes of UTF-8. 11th Unicode Conference.
http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/IUC11-UTF-8.pdf
Li, Shanjian 2002. A composite approach to language/encoding detection. 19th
International Unicode Conference
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.htmlWikipedia contributors, \"Encoding,\" Wikipedia, The Free Encyclopedia,
http://en.wikipedia.org/w/index.php?title=Encoding&oldid=62581088 (accessed September 19, 2006).
因篇幅问题不能全部显示,请点此查看更多更全内容