About Apache Tika
The project is hosted by the Apache Software Foundation. It supports detecting various file and content types. There is a full list of supported formats. When having a look at the list that displays the supported formats, many document formats are listed in there. E.g. text/plain, text/xml, the propritary Microsoft OOXML or the office standard Open Document. Furthermore images (image/gif, image/jpeg, image/bmp or image/tiff), videos (video/avi, video/mpgeg or video/mp4) and audios (audi/ogg, audio/x-wav or audio/mpeg) can be recognized by Tika. Even feeds (application/rss+xml, application/atom+xml) may be recognized. And many, many more …
There are various ways to detect files and content. There is a frontend, a server, a library for java. The results might be returned as plain text or json. Even html is possible.
Each file, each content or stream contains either some header information, or some unique characteristics that help to identify the content type. Tika is using this approach. There is a parser for each content type, that Tika is able to recognize.
Running the Apache Tika Frontend
Apache Tika provides a graphical user interface for determining content. The frontend prints all the information that Tika is able to extract. E.g. the Content-Length, Content-Type or Content-Encoding. The Frontent is really basic and simply prints all the information line per line. For a non technical user this might be quite hard. Otherwise a non technical user normally wouldn’t use such a tool.
In order to run the Tika frontend simply use the command line option -g or --gui.
Using Apache Tika within command line
Additional to the graphical user interface Tika is useable via command line. When running Tika with the command line option -d or --detect, Tika detects the content type of a file and then prints the result to the command line.
$java -jar tika-app-{version}.jar -d {file}
text/plain # document type of the file that was probed
$
Using Apache Tika with Java
When implementing a Java based software, Tika may be used directly as library. Using Tika is quite simple. There is a various number of ways to detect content with Tika the programmatic way. Simply see the following code snippet that describes the usage of Tika. It’s possible to detect anything that is given by a stream, but detection via file name is possible too. In this case the file extension is used to “detect” the content. Furthermore a file or the path to a file may be used for detection. But the binary data of content or a URL can be used too.
Tika tika = new Tika(); tika.detect(InputStream stream); // the document stream tika.detect(String name); // the file name of the document tika.detect(File file); // the file tika.detect(Path path); // the path of the file tika.detect(byte[] prefix); // first few bytes of the document tika.detect(URL url); // the URL of the resource
Dependencies for Development
For developing software that uses Tika the library can be added as dependency. How to use Tika with maven or gradle is described below. Of course other dependency management systems like Ivy, Grapes, … may be used.
Maven
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.13</version> </dependency>
Gradle
dependencies {
compile 'org.apache.tika:tika-core:1.13'
}
Sources
Apache Tika
Transparently improve Java 7 mime-type recognition with Apache Tika
Determining File Types in Java
Get the Mime Type from a File
