A file format is a particular way that information is encoded for storage in a computer file A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished. Computer files can be considered as the modern.
Since a disk drive Disk storage or disc storage is a general category of storage mechanisms, in which data are digitally recorded by various electronic, magnetic, optical, or mechanical methods on a surface layer deposited of one or more planar, round and rotating platters. A disk drive is a device implementing such a storage mechanism with fixed or removable media;, or indeed any computer storage Computer data storage, often called storage or memory, refers to computer components, devices, and recording media that retain digital data used for computing for some interval of time. Computer data storage provides one of the core functions of the modern computer, that of information retention. It is one of the fundamental components of all, can store only bits A bit or binary digit is the basic unit of information in computing and telecommunications; it is the amount of information that can be stored by a digital device or other physical system that can usually exist in only two distinct states. These may be the two stable positions of an electrical switch, two distinct voltage or current levels allowed, the computer must have some way of converting information Information, in its most restricted technical sense, is an ordered sequence of symbols. As a concept, however, information has many meanings. Moreover, the concept of information is closely related to notions of constraint, communication, control, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor A word processor is a computer application used for the production (including composition, editing, formatting, and possibly printing) of any sort of printable material documents, there will typically be several different formats. Sometimes these formats compete with each other.
File formats are divided in proprietary A proprietary format is a file format where the mode of presentation of its data is opaque and its specification is not publicly available. Proprietary formats are typically controlled by a private person or organization for the benefit of its appliances and can be protected with patents or copyrights which are intended to give the license holder and open formats An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical licenses used by each. In contrast to open.
Contents |
Generality
Some file formats are designed to store very particular sorts of data: the JPEG In computing, JPEG is a commonly used method of lossy compression for photographic images. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality format, for example, is designed only to store static photographic images An image is an artifact, for example a two-dimensional picture, that has a similar appearance to some subject—usually a physical object or a person. Other file formats, however, are designed for storage of several different types of data: the GIF The Graphics Interchange Format is a bitmap image format that was introduced by CompuServe in 1987 and has since come into widespread usage on the World Wide Web due to its wide support and portability format supports storage of both still images and simple animations, and the QuickTime QuickTime is an extensible proprietary multimedia framework developed by Apple Inc., capable of handling various formats of digital video, picture, sound, panoramic images, and interactivity. It is available for Mac OS classic , Mac OS X and Microsoft Windows operating systems. The latest version is QuickTime X (10.0) and is currently only format can act as a container for many different types of multimedia Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which only use traditional forms of printed or hand-produced material. Multimedia includes a combination of text, audio,. A text file A text file is a kind of computer file that is structured as a sequence of lines. A text file exists within a computer file system. The end of a text file is often denoted by placing one or more special characters, known as an end-of-file marker, after the last line in a text file is simply one that stores any text, in a format such as ASCII The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text. Most modern character-encoding schemes are based on ASCII, though they support many more characters than did ASCII or UTF-8 UTF-8 is a variable-length character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set, but unlike them it has the special property of being backwards-compatible with ASCII. For this reason, it is steadily becoming the dominant character encoding for files, e-mail, web pages, and, with few if any control characters In computing and telecommunication, a control character or non-printing character is a code point in a character set, that does not in itself represent a written symbol. It is in-band signaling in the context of character encoding. All entries in the ASCII table below code 32 (technically the C0 control code set) and 127 are of this kind,. Some file formats, such as HTML HTML, which stands for HyperText Markup Language, is the predominant markup language for web pages. It provides a means to create structured documents by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes and other items. It allows images and objects to be embedded and can be used to create interactive forms, or the source code In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language. Source code is the means most often used by programmers to specify the actions to be performed by a computer of some particular programming language, are in fact also text files, but adhere to more specific rules which allow them to be used for specific purposes.
Specifications
Many file formats, including some of the most well-known file formats, have a published specification A specification is an explicit set of requirements to be satisfied by a material, product, or service. Should a material, product or service fail to meet one or more of the applicable specifications, it may be referred to as being out of specification; the abbreviation OOS may also be used document (often with a reference implementation In the software development process, a reference implementation is the standard from which all other implementations, with their attendant customizations, are measured, and to which all improvements are added. An improving reference implementation in turn reflects its unchanging specification, or else an attempt at an implementation may prove that) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular program A computer program is a sequence of instructions written to perform a specified task for a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute the instructions. The same program in its human- treats a particular file format correctly. There are, however, two reasons why this is not always the case. First, some file format developers view their specification documents as trade secrets A trade secret is a formula, practice, process, design, instrument, pattern, or compilation of information which is not generally known or reasonably ascertainable, by which a business can obtain an economic advantage over competitors or customers. In some jurisdictions, such secrets are referred to as "confidential information" or ", and therefore do not release them to the public. Second, some file format developers never spend time writing a separate specification document; rather, the format is defined only implicitly, through the program(s) that manipulate data in the format.
Using file formats without a publicly available specification can be costly. Learning how the format works will require either reverse engineering Reverse engineering is the process of discovering the technological principles of a device, object or system through analysis of its structure, function and operation. It often involves taking something (e.g., a mechanical device, electronic component, or software program) apart and analyzing its workings in detail to be used in maintenance, or to it from a reference implementation or acquiring the specification document for a fee from the format developers. This second approach is possible only when there is a specification document, and typically requires the signing of a non-disclosure agreement A non-disclosure agreement , also known as a confidentiality agreement, confidential disclosure agreement (CDA), proprietary information agreement (PIA), or secrecy agreement, is a legal contract between at least two parties that outlines confidential material, knowledge, or information that the parties wish to share with one another for certain. Both strategies require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.
Patent A patent is a set of exclusive rights granted by a state (national government) to an inventor or their assignee for a limited period of time in exchange for a public disclosure of an invention law, rather than copyright Copyright is the set of exclusive rights granted to the author or creator of an original work, including the right to copy, distribute and adapt the work. These rights can be licensed, transferred and/or assigned. Copyright lasts for a certain time period after which the work is said to enter the public domain. Copyright applies to a wide range of, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats require the encoding of data with patented algorithms In mathematics, computer science, and related subjects, an 'algorithm' is an effective method for solving a problem expressed as a finite sequence of instructions. Algorithms are used for calculation, data processing, and many other fields. For example, using compression with the GIF file format requires the use of a patented algorithm, and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has resulted in a significant decrease in the use of GIFs The Graphics Interchange Format is a bitmap image format that was introduced by CompuServe in 1987 and has since come into widespread usage on the World Wide Web due to its wide support and portability, and is partly responsible for the development of the alternative PNG Portable Network Graphics is a bitmapped image format that employs lossless data compression. PNG was created to improve upon and replace GIF (Graphics Interchange Format) as an image-file format not requiring a patent license. It is pronounced /ˈpɪŋ/ ping, or pee-en-gee. The PNG acronym is optionally recursive, unofficially standing for PNG's format. However, the patent expired in the US in mid-2003 2003 was a common year that started on a Wednesday, according to the Gregorian calendar. It was the 2003rd year of the Common Era or the Anno Domini designation; the 3rd year of the 3rd millennium and of the 21st century; and the 4th of the 2000s decade, and worldwide in mid-2004 2004 was a leap year that started on a Thursday. In the Gregorian calendar, the year 2004 was the 2004th year in the Anno Domini or Common Era, the 4th year in the 3rd millennium and of the 21st century, and the 5th in the 2000s decade. Algorithms are usually held not to be patentable under current European law, which also includes a provision that members "shall ensure that, wherever the use of a patented technique is needed for a significant purpose such as ensuring conversion of the conventions used in two different computer systems or networks so as to allow communication and exchange of data content between them, such use is not considered to be a patent infringement", which would apparently allow implementation of a patented file system where necessary to allow two different computers to interoperate.[1]
Identifying the type of a file
A method is required to determine the format of a particular file within the filesystem A file system is a method of storing and organizing computer files and their data. Essentially, it organizes these files into a database for the storage, organization, manipulation, and retrieval by the computer's operating system—an example of metadata Metadata is "data about data", of any sort in any media. Metadata is text, voice, or image that describes what the audience wants or needs to see or experience. The audience could be a person, group, or software program. Metadata is important because it aids in clarifying and finding the actual data. An item of metadata may describe an. Different operating systems An operating system is the software on a computer that manages the way different programs use its hardware, and regulates the ways that a user controls the computer. Operating systems are found on almost any device that contains a computer with multiple programs—from cellular phones and video game consoles to supercomputers and web servers. Some have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages.
Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.
Filename extension
Main article: Filename extension A filename extension is a suffix to the name of a computer file applied to indicate the encoding convention of its contentsOne popular method in use by several operating systems, including Mac OS X Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, Mac OS X has been included with all new Macintosh computer systems. It is the successor to Mac OS 9, the final release of the "classic" Mac OS, which had been Apple's primary operating system since 198, CP/M CP/M is an operating system originally created for Intel 8080/85 based microcomputers by Gary Kildall of Digital Research, Inc. Initially confined to single-tasking on 8-bit processors and no more than 64 kilobytes of memory, later versions of CP/M added multi-user variations, and were migrated to 16-bit processors, DOS DOS, short for "Disk Operating System", is a shorthand term for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition, VMS, VM/CMS, and Windows Microsoft Windows is a series of software operating systems and graphical user interfaces produced by Microsoft. Microsoft first introduced an operating environment named Windows in November 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal, is to determine the format of a file based on the section of its name following the final period. This portion of the filename is known as the filename extension A filename extension is a suffix to the name of a computer file applied to indicate the encoding convention of its contents. For example, HTML documents are identified by names that end with .html (or .htm), and GIF images by .gif. In the original FAT File Allocation Table or FAT is a computer file system architecture now widely used on many computer systems and most memory cards, such as those used with digital cameras. FAT file systems are commonly found on floppy disks, flash memory cards, digital cameras, and many other portable devices because of their relative simplicity. Performance of filesystem, filenames were limited to an eight-character identifier and a three-character extension, which is known as 8.3 filename An 8.3 filename is a filename convention used by old versions of DOS, versions of Microsoft Windows prior to Windows 95, and Windows NT 3.51. It is also used in modern Microsoft operating systems as an alternate filename to the long filename for compatibility with legacy programs. The filename convention is limited by the FAT file system. Similar 8. Many formats thus still use three-character extensions, even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse the operating system and consequently users.
One artifact of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly.
This led more recent operating system shells A shell is a piece of software that provides an interface for users. Typically, the term refers to an operating system shell which provides access to the services of a kernel. However, the term is also applied very loosely to applications and may include any software that is "built around" a particular component, such as web browsers and, such as Windows 95 Windows 95 is a consumer-oriented graphical user interface-based operating system. It was released on August 24, 1995 by Microsoft, and was a significant progression from the company's previous Windows products. During development it was referred to as Windows 4.0 or by the internal codename Chicago and Mac OS X Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, Mac OS X has been included with all new Macintosh computer systems. It is the successor to Mac OS 9, the final release of the "classic" Mac OS, which had been Apple's primary operating system since 198, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.
A downside of hiding the extension is that it then becomes possible to have what appear to be two or more identical filenames in the same folder. This is especially true when image files are needed in more than one format for different applications. For example, a company logo may be needed both in .tif format (for publishing) and .gif format (for web sites). With the extensions visible, these would appear as the unique filenames "CompanyLogo.tif" and "CompanyLogo.gif". With the extensions hidden, these would both appear to have the identical filename "CompanyLogo", making it more difficult to determine which to select for a particular application.
A further downside is that hiding such information can become a security risk[2]. This is because on a filename extensions reliant system all usable files will have such an extension (for example all JPEG images will have ".jpg" or ".jpeg" at the end of their name), so seeing file extensions would be a common occurrence and users may depend on them when looking for a file's format. By having file extensions hidden a malicious user can create an executable program A Trojan, sometimes referred to as a Trojan horse, is non-self-replicating malware that appears to perform a desirable function for the user but instead facilitates unauthorized access to the user's computer system. The term is derived from the Trojan Horse story in Greek mythology with an innocent name such as "Holiday photo.jpg.exe". In this case the ".exe" will be hidden and a user will see this file as "Holiday photo.jpg", which appears to be a JPEG image, unable to harm the machine save for bugs in the application used to view it. However, the operating system will still see the ".exe" extension and thus will run the program, which is then able to cause harm and presents a security issue. To further trick users, it is possible to store an icon inside the program, as done on Microsoft Windows, in which case the operating system's icon assignment can be overridden with an icon commonly used to represent JPEG images, making such a program look like and appear to be called an image, until it is opened that is. This issue requires users with extensions hidden to be vigilant, and never open files which seem to have a known extension displayed despite the hidden option being enabled (since it must therefore have 2 extensions, the real one being unknown until hiding is disabled). This presents a practical problem for Windows systems where extension hiding is turned on by default.
Internal metadata
A second way to identify a file format is to store information regarding the format inside the file itself. Usually, such information is written in one (or more) binary string(s), tagged or raw texts placed in fixed, specific locations within the file. Since the easiest place to locate them is at the beginning of it, such area is usually called a file header when it is greater than a few bytes, or a magic number if it is just a few bytes long.
File header
First of all, the meta-data contained in a file header In information technology, header refers to supplemental data placed at the beginning of a block of data being stored or transmitted. In data transmission, the data following the header are sometimes called the payload or body are not necessarily stored only at the beginning of it, but might be present in other areas too, often including the end of the file; that depends on the file format or the type of data it contains. Character-based (text) files have character-based human-readable headers, whereas binary formats usually feature binary headers, although that is not a rule: a human-readable file header may require more bytes, but is easily discernable with simple text or hexadecimal editors. File headers may not only contain the information required by algorithms to identify the file format alone, but also real metadata about the file and its contents. For example most image file formats Image file formats are standardized means of organizing and storing digital images. Image files are composed of either pixel or vector data that are rasterized to pixels when displayed (with few exceptions) in a vector graphic display. The pixels that constitute an image are ordered as a grid (columns and rows); each pixel consists of numbers store information about image size, resolution, colour space/format and optionally other authoring information like who, when and where it was made, what camera model and shooting parameters was it taken with (if any, cfr. Exif Exchangeable image file format is a specification for the image file format used by digital cameras. The specification uses the existing JPEG, TIFF Rev. 6.0, and RIFF WAV file formats, with the addition of specific metadata tags. It is not supported in JPEG 2000, PNG, or GIF), and so on. Such metadata may be used by a program reading or interpreting the file both during the loading process and after that, but can also be used by the operating system to quickly capture information about the file itself without loading it all into memory.
The downsides of file header as a file-format identification method are at least two. First, at least a few (initial) blocks of the file need to be read in order to gain such information; those could be fragmented in different locations of the same storage medium, thus requiring more seek and I/O time, which is particularly bad for the identification of large quantities of files altogether (like a GUI browsing inside a folder with thousands or more files and discerning file icons or thumbnails for all of them to visualize). Second, if the header is binary hard-coded (i.e. the header itself is subject to a non-trivial interpretation in order to be recognized), especially for metadata content protection's sake, there is some risk that file format is misinterpreted at first sight, or even badly written at the source, often resulting in corrupt metadata (which, in extremely pathological cases, might even render the file unreadable anymore).
A more logically sophisticated example of file header is that used in wrapper (or container) file formats.
Magic number
See also: Magic number (programming)One way to incorporate such metadata, often associated with Unix and its derivatives, is just to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition that starts with <!DOCTYPE, or, for XHTML, the XML identifier, which begins with <?xml. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML.
The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reasonably reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata-based methods need check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where filetypes don't lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if a file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type. On the other hand a valid magic number does not guarantee that the file is not corrupt or of a wrong type.
So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.
Another operating system using magic numbers is AmigaOS, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in Hunk executable file format and also to let single programs, tools and utilities deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with the Amiga standard Datatype recognition system. Another method was the FourCC method, originating in OSType on Macintosh, later adapted by Interchange File Format (IFF) and derivatives.
External metadata
A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself.
This approach keeps the metadata separate from both the main data and the name, but is also less portable than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.
Note that zip files or archive files solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.
Mac OS type-codes
The Mac OS' Hierarchical File System stores codes for creator and type as part of the directory entry for each file. These codes are referred to as OSTypes, and for instance a HyperCard "stack" file has a creator of WILD (from Hypercard's previous name, "WildCard") and a type of STAK. The type code specifies the format of the file, while the creator code specifies the default program to open it with when double-clicked by the user. For example, the user could have several text files all with the type code of TEXT, but which each open in a different program, due to having differing creator codes. RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions — e.g. the hexadecimal number FF5 is "aliased" to PoScript, representing a PostScript file.
Mac OS X Uniform Type Identifiers (UTIs)
Main article: Uniform Type IdentifierA Uniform Type Identifier (UTI) is a method used in Mac OS X for uniquely identifying "typed" classes of entity, such as file formats. It was developed by Apple as a replacement for OSType (type & creator codes).
The UTI is a Core Foundation string, which uses a reverse-DNS string. Common or standard types use the public domain (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in the OS X file system, including:
- Pasteboard data
- Folders (directories)
- Translatable types (as handled by the Translation Manager)
- Bundles
- Frameworks
- Streaming data
- Aliases and symlinks
OS/2 Extended Attributes
The HPFS, FAT12 and FAT16 (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under OS/2). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types.
The NTFS filesystem also allows to store OS/2 extended attributes, as one of file forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.
POSIX extended attributes
On Unix and Unix-like systems, the ext2, ext3, ReiserFS version 3, XFS, JFS, FFS, and HFS+ filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique, which can be accessed by their "name" parts.
PRONOM Unique Identifiers (PUIDs)
The PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique and unambiguous identifiers for file formats, which has been developed by The National Archives of the UK as part of its PRONOM technical registry service. PUIDs can be expressed as Uniform Resource Identifiers using the info:pronom/ namespace. Although not yet widely used outside of UK government and some digital preservation programmes, the PUID scheme does provide greater granularity than most alternative schemes.
MIME types
MIME types are widely used in many Internet-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by IANA) consisting of a type and a sub-type, separated by a slash — for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail, independent of the source and target operating systems. MIME types identify files on BeOS, AmigaOS 4.0 and MorphOS, as well as store unique application signatures for application launching. In AmigaOS and MorphOS the Mime type system works in parallel with Amiga specific Datatype system.
There are problems with the MIME types though; several organisations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.
File format identifiers (FFIDs)
File format identifiers is another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part indicates the organisation origin/maintainer (this number represents a value in a company/standards organisation database), the 2 following digits categorize the type of file in hexadecimal. The final part is composed of the usual file extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the ISO Organisation.
File content based format identification
Another but least popular way to identify the file format is to look at the file contents for distinguishable patterns among file types. As we know, the file contents are sequence of bytes and a byte has 256 unique patterns (0~255). Thus, counting the occurrence of byte patterns that is often referred as byte frequency distribution gives distinguishable patterns to identify file types. There are many content based file type identification schemes that use byte frequency distribution to build the representative models for file type and use any statistical and data mining techniques to identify file types [3]
File structure
There are several types of ways to structure data in a file. The most usual ones are described below.
Unstructured formats (raw memory dumps)
Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file.
This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal string is not recognized as such in C). On the other hand, developing tools for reading and writing these types of files is very simple.
The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.
Chunk based formats
Electronic Arts and Commodore-Amiga pioneered this file format in 1985, with their IFF (Interchange File Format) file format. In this kind of file structure, each piece of data is embedded in a container that contains a signature identifying the data, as well the length of the data (for binary encoded files). This type of container is called a "chunk". The signature is usually called a chunk id, chunk identifier, or tag identifier.
With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand.
This concept has been taken again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (Distinguished Encoding Rules) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and Structured Data Exchange Format (SDXF). Even XML can be considered a kind of chunk based format, since each data element is surrounded by tags which are akin to chunk identifiers.
Directory based formats
This is another extensible format, that closely resembles a file system (OLE Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk images, OLE documents and TIFF images.
See also
| Look up file format in Wiktionary, the free dictionary. |
- Audio file format
- Chemical file format
- Container format (digital)
- Document file format
- DROID file format identification utility
- File (command), a file type identification utility
- File Formats, Transformation, and Migration (related wikiversity article)
- FormatFactory, a free omni file format converter.
- Future proofing
- Graphics file format summary
- List of archive formats
- Image file formats
- List of file formats
- List of free file formats
- List of motion and gesture file formats
- Magic number (programming)
- List of file signatures, or "magic numbers"
- Object file
- Open format
- TrID, a freeware file type identification utility
- Windows file types
References
- ^ Foundation for a Free Information Infrastructure. "Europarl 2003-09-24: Amended Software Patent Directive". http://swpat.ffii.org/papers/europarl0309/index.en.html. Retrieved 2007-01-07.
- ^ PC World. "Windows Tips: For Security Reasons, It Pays To Know Your File Extensions". http://www.pcworld.com/article/id,113758-page,1/article.html. Retrieved 2008-06-20.
- ^ "File Format Identification". http://www.forensicswiki.org/wiki/File_Format_Identification.
- "Extended Attribute Data Types". REXX Tips & Tricks, Version 2.80. http://markcrocker.com/rexxtipsntricks/rxtt28.2.0301.html. Retrieved February 9 2005.
- "Extended Attributes used by the WPS". REXX Tips & Tricks, Version 2.80. http://markcrocker.com/rexxtipsntricks/rxtt28.2.0300.html. Retrieved February 9 2005.
- "Extended Attributes - what are they and how can you use them ?". Roger Orr. http://www.howzatt.demon.co.uk/articles/06may93.html. Retrieved February 9 2005.
External links
- FileInfo.com - File types resource
Categories: Computer file formats
|
Wed, 23 Jun 2010 14:49:37 GMT+00:00
idg convert your PDF files: PDF to Word is a neat Web service that converts PDF files to Word format so that you can edit them. Alternatively, you can download ...
400px x 510px | 13.70kB
[source page]
Overview MOD and TOD are informal names of tapeless video formats used by jvc mod and TOD Panasonic MOD only and Canon MOD only in some models of digital camcorders
admin
ue, 20 Jul 2010 19:18:53 GM
This programme supports many IP list . formats. , such as P2P filter plaintext . format file. , eMule IP filter data . file. , and Apache's .htaccess . file. . If users want to know where their computers communicate, the 'IP address look up' programme ...
Q. I want to post my RAW photographs on my deviantArt, but i dont know what file format i should save as. Im not going to save as JPEG because that just kills your image, but i dont know what format will best preserve my image quality. PNG? PSD? BMP? Should i choose interlaced? There are so many choices.. ;___: Help?
Asked by BrokenAngel - Mon Oct 19 19:37:06 2009 - - 1 Answers - 0 Comments
A. You are right - there are so many choices, plus you need proper software to process them. Hard to advise a particular format, but just in case you are wondering what software to use, here is what I can advise: RAW files from many cameras are supported for further processing:
Answered by Lessy D - Wed Oct 21 08:39:55 2009


