[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
GitHub user adamonduty opened a pull request:

    https://github.com/apache/incubator-nifi/pull/27

    NIFI-296: Extend capability of IdentifyMimeType

    ```
    This commit backs IdentifyMimeType with the Apache Tika library. Tika
    provides detailed mime type identification such as the ability to
    differentiate normal zip files from OOXML MS Office documents.
   
    The mime.type attribute continues to be set, though some mime types
    have changed due to Tika naming them differently. In addition,
    the mime.extension attribute is set to provide the commonly used
    extension for the mime type (if known).
    ```
   
    Some additional notes about this commit:
   
    I removed the IDENTIFY_ZIP and IDENTIFY_TAR properties. Keeping IDENTIFY_ZIP doesn't make sense because Tika is designed to identify container formats like zip files. Excluding zip files from detection would exclude a number of common mime types, which seems like undesirable behavior. IDENTIFY_TAR is in a similar situation.
   
    Also, in both cases, the previous code would "identify" a zip or tar file by attempting to open them with Zip and Tar readers. I believe Tika will use magic byte detection as a filtering mechanism to avoid applying deep inspection logic (ie opening the zip with a reader) when not necessary.
   
    It takes about 2 seconds to bring up the Tika detectors, which makes the tests run longer, but I believe the detection itself is roughly in the same performance category. The code shares a Tika config and list of detectors to minimize the performance impact related to bringing up detectors.
   
    I also replaced the test resource `1.tar` with a version created by a modern version of tar. The previous tar didn't use the <a href="http://en.wikipedia.org/wiki/Tar_%28computing%29#UStar_format">ustar format</a>, which was standardized in 1988. Tika also couldn't identify the previous tar using magic byte
    detection.
   
    And finally, a few of the detected mime types changed names due to Tika naming them differently.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adamonduty/incubator-nifi NIFI-296-extend-IdentifyMimeType

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-nifi/pull/27.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #27
   
----
commit 16fb2b826c0cd983b5d905ceed7aff2a84383d33
Author: Adam Lamar <[hidden email]>
Date:   2015-02-14T20:57:41Z

    NIFI-296: Extend capability of IdentifyMimeType
   
    This commit backs IdentifyMimeType with the Apache Tika library. Tika
    provides detailed mime type identification such as the ability to
    differentiate normal zip files from OOXML MS Office documents.
   
    The mime.type attribute continues to be set, though some mime types
    have changed due to Tika naming them differently. In addition,
    the mime.extension attribute is set to provide the commonly used
    extension for the mime type (if known).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24942181
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -239,87 +128,39 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
             }
     
             final ProcessorLog logger = getLogger();
    -        final boolean identifyZip = context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        final boolean identifyTar = context.getProperty(IDENTIFY_TAR).asBoolean();
     
             final ObjectHolder<String> mimeTypeRef = new ObjectHolder<>(null);
    +        final ObjectHolder<String> extensionRef = new ObjectHolder<>(null);
             session.read(flowFile, new InputStreamCallback() {
                 @Override
                 public void process(final InputStream stream) throws IOException {
                     try (final InputStream in = new BufferedInputStream(stream)) {
    -                    // read in up to magicHeaderMaxLength bytes
    -                    in.mark(magicHeaderMaxLength);
    -                    byte[] header = new byte[magicHeaderMaxLength];
    -                    for (int i = 0; i < header.length; i++) {
    -                        final int next = in.read();
    -                        if (next >= 0) {
    -                            header[i] = (byte) next;
    -                        } else if (i == 0) {
    -                            header = new byte[0];
    -                        } else {
    -                            final byte[] newBuffer = new byte[i - 1];
    -                            System.arraycopy(header, 0, newBuffer, 0, i - 1);
    -                            header = newBuffer;
    -                            break;
    -                        }
    -                    }
    -                    in.reset();
    -
    -                    for (final MagicHeader magicHeader : magicHeaders) {
    -                        if (magicHeader.matches(header)) {
    -                            mimeTypeRef.set(magicHeader.getMimeType());
    -                            return;
    -                        }
    -                    }
    -
    -                    if (!identifyZip) {
    -                        for (final MagicHeader magicHeader : zipMagicHeaders) {
    -                            if (magicHeader.matches(header)) {
    -                                mimeTypeRef.set(magicHeader.getMimeType());
    -                                return;
    -                            }
    -                        }
    -                    }
    -
    -                    if (!identifyTar) {
    -                        for (final MagicHeader magicHeader : tarMagicHeaders) {
    -                            if (magicHeader.matches(header)) {
    -                                mimeTypeRef.set(magicHeader.getMimeType());
    -                                return;
    -                            }
    -                        }
    +                    TikaInputStream tikaStream = TikaInputStream.get(in);
    +                    Metadata metadata = new Metadata();
    +                    // Get mime type
    +                    MediaType mediatype = detector.detect(tikaStream, metadata);
    +                    mimeTypeRef.set(mediatype.toString());
    +                    // Get common file extension
    +                    try {
    +                        MimeType mimetype;
    +                        mimetype = config.getMimeRepository().forName(mediatype.toString());
    +                        extensionRef.set(mimetype.getExtension());
    +                    } catch (MimeTypeException ex) {
    +                        logger.warn("MIME type detection failed: {}", new Object[]{ex.toString()});
    --- End diff --
   
    I would use "new Object[] {ex}" rather than "new Object[] {ex.toString()}"... this way, if the user configures logback to use debug-level logging, the logger will automatically get the stack trace.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24942791
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES)) {
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT)) {
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    --- End diff --
   
    I'm not sure that I follow the logic here. It appears that you are expecting the first several bytes to be the filename of the attributes file... this is not going to happen for a .tar file, as the first several bytes will be tar header info.
    We should probably wait until after Tika has identified a file as a .tar file and then if it is a .tar file perform the logic below to check the filename of the first file. If it's a match then call it flowfile-stream-v1, else call it tar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user adamonduty commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24943462
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES)) {
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT)) {
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    --- End diff --
   
    The beauty of this approach is the first bytes of the <a href="http://en.wikipedia.org/wiki/Tar_%28computing%29#File_header">tar header</a> are the name of the file. This assumes the first file will be `flowfile.attributes`, but the later code assumes the same. Also, its important to avoid invoking the tar reader in this situation, because this detector runs on every file (the custom detectors run before Tika's detectors).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user adamonduty commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24943517
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES)) {
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT)) {
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    --- End diff --
   
    For example,
   
    ```
    $ xxd flowfilev1.tar
    0000000: 666c 6f77 6669 6c65 2e61 7474 7269 6275  flowfile.attribu
    0000010: 7465 7300 0000 0000 0000 0000 0000 0000  tes.............
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user adamonduty commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24943917
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -239,87 +128,39 @@ public void onTrigger(final ProcessContext context, final ProcessSession session
             }
     
             final ProcessorLog logger = getLogger();
    -        final boolean identifyZip = context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        final boolean identifyTar = context.getProperty(IDENTIFY_TAR).asBoolean();
     
             final ObjectHolder<String> mimeTypeRef = new ObjectHolder<>(null);
    +        final ObjectHolder<String> extensionRef = new ObjectHolder<>(null);
             session.read(flowFile, new InputStreamCallback() {
                 @Override
                 public void process(final InputStream stream) throws IOException {
                     try (final InputStream in = new BufferedInputStream(stream)) {
    -                    // read in up to magicHeaderMaxLength bytes
    -                    in.mark(magicHeaderMaxLength);
    -                    byte[] header = new byte[magicHeaderMaxLength];
    -                    for (int i = 0; i < header.length; i++) {
    -                        final int next = in.read();
    -                        if (next >= 0) {
    -                            header[i] = (byte) next;
    -                        } else if (i == 0) {
    -                            header = new byte[0];
    -                        } else {
    -                            final byte[] newBuffer = new byte[i - 1];
    -                            System.arraycopy(header, 0, newBuffer, 0, i - 1);
    -                            header = newBuffer;
    -                            break;
    -                        }
    -                    }
    -                    in.reset();
    -
    -                    for (final MagicHeader magicHeader : magicHeaders) {
    -                        if (magicHeader.matches(header)) {
    -                            mimeTypeRef.set(magicHeader.getMimeType());
    -                            return;
    -                        }
    -                    }
    -
    -                    if (!identifyZip) {
    -                        for (final MagicHeader magicHeader : zipMagicHeaders) {
    -                            if (magicHeader.matches(header)) {
    -                                mimeTypeRef.set(magicHeader.getMimeType());
    -                                return;
    -                            }
    -                        }
    -                    }
    -
    -                    if (!identifyTar) {
    -                        for (final MagicHeader magicHeader : tarMagicHeaders) {
    -                            if (magicHeader.matches(header)) {
    -                                mimeTypeRef.set(magicHeader.getMimeType());
    -                                return;
    -                            }
    -                        }
    +                    TikaInputStream tikaStream = TikaInputStream.get(in);
    +                    Metadata metadata = new Metadata();
    +                    // Get mime type
    +                    MediaType mediatype = detector.detect(tikaStream, metadata);
    +                    mimeTypeRef.set(mediatype.toString());
    +                    // Get common file extension
    +                    try {
    +                        MimeType mimetype;
    +                        mimetype = config.getMimeRepository().forName(mediatype.toString());
    +                        extensionRef.set(mimetype.getExtension());
    +                    } catch (MimeTypeException ex) {
    +                        logger.warn("MIME type detection failed: {}", new Object[]{ex.toString()});
    --- End diff --
   
    Didn't know that! I'll fix and re-push.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user adamonduty commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24945345
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES)) {
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT)) {
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    --- End diff --
   
    And at the risk of being too chatty, I'll admit that I like your idea of hooking into Tika when it identifies the tar, but I did not see a straightforward way to do so with the Tika API. At least in the class hierarchy for <a href="http://tika.apache.org/1.7/api/">1.7</a>, there is no mention of tar at all, which leaves me to believe that tar is done entirely by magic byte detection, and I'm unsure if Tika offers any flexible way of hooking into that process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24956147
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES)) {
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT)) {
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    --- End diff --
   
    Don't worry about being too chatty -- better to be chatty and get this right than to be quiet and provide an inferior product :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24956244
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES)) {
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT)) {
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    --- End diff --
   
    wow good call on the first 100 bytes being the filename -- I looked up the tar format to see if that was indeed the case but found this big, verbose, confusing explanation of the header that I didn't understand -- should have tried wikipedia first. I am sorry that I doubted you :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/incubator-nifi/pull/27#discussion_r24956579
 
    --- Diff: nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/IdentifyMimeType.java ---
    @@ -327,148 +168,41 @@ public void process(final InputStream in) throws IOException {
             session.transfer(flowFile, REL_SUCCESS);
         }
     
    -    private static interface ContentScanningMimeTypeIdentifier {
    -
    -        boolean isEnabled(ProcessContext context);
    -
    -        String getMimeType(InputStream in) throws IOException;
    -    }
    -
    -    private static class ZipIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            final ZipInputStream zipIn = new ZipInputStream(in);
    -            try {
    -                if (zipIn.getNextEntry() != null) {
    -                    return "application/zip";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_ZIP).asBoolean();
    -        }
    -    }
    -
    -    private static class TarIdentifier implements ContentScanningMimeTypeIdentifier {
    -
    -        @Override
    -        public String getMimeType(final InputStream in) throws IOException {
    -            try (final TarArchiveInputStream tarIn = new TarArchiveInputStream(in)) {
    -                final TarArchiveEntry firstEntry = tarIn.getNextTarEntry();
    -                if (firstEntry != null) {
    -                    if (firstEntry.getName().equals(FlowFilePackagerV1.FILENAME_ATTRIBUTES)) {
    -                        final TarArchiveEntry secondEntry = tarIn.getNextTarEntry();
    -                        if (secondEntry != null && secondEntry.getName().equals(FlowFilePackagerV1.FILENAME_CONTENT)) {
    -                            return "application/flowfile-v1";
    -                        }
    -                    }
    -                    return "application/tar";
    -                }
    -            } catch (final Exception e) {
    -            }
    -            return null;
    -        }
    -
    -        @Override
    -        public boolean isEnabled(final ProcessContext context) {
    -            return context.getProperty(IDENTIFY_TAR).asBoolean();
    -        }
    +    private Detector getFlowFileV3Detector() {
    +        return new MagicDetector(FLOWFILE_V3, FlowFilePackagerV3.MAGIC_HEADER);
         }
     
    -    private static interface MagicHeader {
    -
    -        int getRequiredBufferLength();
    -
    -        String getMimeType();
    -
    -        boolean matches(final byte[] header);
    +    private Detector getFlowFileV1Detector() {
    +        return new FlowFileV1Detector();
         }
     
    -    private static class SimpleMagicHeader implements MagicHeader {
    -
    -        private final String mimeType;
    -        private final int offset;
    -        private final byte[] byteSequence;
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence) {
    -            this(mimeType, byteSequence, 0);
    -        }
    -
    -        public SimpleMagicHeader(final String mimeType, final byte[] byteSequence, final int offset) {
    -            this.mimeType = mimeType;
    -            this.byteSequence = byteSequence;
    -            this.offset = offset;
    -        }
    -
    -        @Override
    -        public int getRequiredBufferLength() {
    -            return byteSequence.length + offset;
    -        }
    -
    -        @Override
    -        public String getMimeType() {
    -            return mimeType;
    -        }
    +    private class FlowFileV1Detector implements Detector {
     
             @Override
    -        public boolean matches(final byte[] header) {
    -            if (header.length < getRequiredBufferLength()) {
    -                return false;
    +        public MediaType detect(InputStream in, Metadata mtdt) throws IOException {
    +            // Sanity check the stream. This may not be a tarfile at all
    +            in.mark(FlowFilePackagerV1.FILENAME_ATTRIBUTES.length());
    +            byte[] bytes = new byte[FlowFilePackagerV1.FILENAME_ATTRIBUTES.length()];
    +            in.read(bytes);
    --- End diff --
   
    I would use "StreamUtils.fillBuffer(in, bytes, false)" here, to make sure that we get at least the number of bytes we need. I can't imagine that a call to read() in this case would read fewer than the 19 bytes that we need... but just to be sure :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user adamonduty commented on the pull request:

    https://github.com/apache/incubator-nifi/pull/27#issuecomment-75210872
 
    Pushed a fix to both of those issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

Re: [GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

Mark Payne
In reply to this post by JPercivall
Adam,


I definitely like the changes here. I reviewed the code, and I am happy with it.


The only thing here that is really giving me pause is the unexpectedly large size of the dependency. Pulling in Tika ends up bloating the standard nar from 12 MB to a whopping 37 MB. This isn’t the end of the world, but I am concerned about pulling this in because the deployment already is over 100 MB, and there have been some discussions before about the concern of the NiFi build becoming so bloated.


Are others ok with adding the 25 MB to the build for IdentifyMimeType, or does this give others pause as well?









From: adamonduty
Sent: ‎Tuesday‎, ‎February‎ ‎17‎, ‎2015 ‎1‎:‎56‎ ‎PM
To: [hidden email]





GitHub user adamonduty opened a pull request:

    https://github.com/apache/incubator-nifi/pull/27

    NIFI-296: Extend capability of IdentifyMimeType

    ```
    This commit backs IdentifyMimeType with the Apache Tika library. Tika
    provides detailed mime type identification such as the ability to
    differentiate normal zip files from OOXML MS Office documents.
   
    The mime.type attribute continues to be set, though some mime types
    have changed due to Tika naming them differently. In addition,
    the mime.extension attribute is set to provide the commonly used
    extension for the mime type (if known).
    ```
   
    Some additional notes about this commit:
   
    I removed the IDENTIFY_ZIP and IDENTIFY_TAR properties. Keeping IDENTIFY_ZIP doesn't make sense because Tika is designed to identify container formats like zip files. Excluding zip files from detection would exclude a number of common mime types, which seems like undesirable behavior. IDENTIFY_TAR is in a similar situation.
   
    Also, in both cases, the previous code would "identify" a zip or tar file by attempting to open them with Zip and Tar readers. I believe Tika will use magic byte detection as a filtering mechanism to avoid applying deep inspection logic (ie opening the zip with a reader) when not necessary.
   
    It takes about 2 seconds to bring up the Tika detectors, which makes the tests run longer, but I believe the detection itself is roughly in the same performance category. The code shares a Tika config and list of detectors to minimize the performance impact related to bringing up detectors.
   
    I also replaced the test resource `1.tar` with a version created by a modern version of tar. The previous tar didn't use the <a href="http://en.wikipedia.org/wiki/Tar_%28computing%29#UStar_format">ustar format</a>, which was standardized in 1988. Tika also couldn't identify the previous tar using magic byte
    detection.
   
    And finally, a few of the detected mime types changed names due to Tika naming them differently.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adamonduty/incubator-nifi NIFI-296-extend-IdentifyMimeType

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-nifi/pull/27.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #27
   
----
commit 16fb2b826c0cd983b5d905ceed7aff2a84383d33
Author: Adam Lamar <[hidden email]>
Date:   2015-02-14T20:57:41Z

    NIFI-296: Extend capability of IdentifyMimeType
   
    This commit backs IdentifyMimeType with the Apache Tika library. Tika
    provides detailed mime type identification such as the ability to
    differentiate normal zip files from OOXML MS Office documents.
   
    The mime.type attribute continues to be set, though some mime types
    have changed due to Tika naming them differently. In addition,
    the mime.extension attribute is set to provide the commonly used
    extension for the mime type (if known).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

Re: [GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

Joe Witt
Mark / Adam,

The size of this thing does seem a bit problematic to me.  Perhaps we
can engage with the Tika team to see if they have recommendations on
how we can reduce it down.  We likely don't need its full horsepower.
But 37 MB is a ton and doing all the work for all the dependencies and
licensing accounting is a non-trivial tail as well.

Thanks
Joe

On Mon, Feb 23, 2015 at 3:17 PM, Mark Payne <[hidden email]> wrote:

> Adam,
>
>
> I definitely like the changes here. I reviewed the code, and I am happy with it.
>
>
> The only thing here that is really giving me pause is the unexpectedly large size of the dependency. Pulling in Tika ends up bloating the standard nar from 12 MB to a whopping 37 MB. This isn’t the end of the world, but I am concerned about pulling this in because the deployment already is over 100 MB, and there have been some discussions before about the concern of the NiFi build becoming so bloated.
>
>
> Are others ok with adding the 25 MB to the build for IdentifyMimeType, or does this give others pause as well?
>
>
>
>
>
>
>
>
>
> From: adamonduty
> Sent: ‎Tuesday‎, ‎February‎ ‎17‎, ‎2015 ‎1‎:‎56‎ ‎PM
> To: [hidden email]
>
>
>
>
>
> GitHub user adamonduty opened a pull request:
>
>     https://github.com/apache/incubator-nifi/pull/27
>
>     NIFI-296: Extend capability of IdentifyMimeType
>
>     ```
>     This commit backs IdentifyMimeType with the Apache Tika library. Tika
>     provides detailed mime type identification such as the ability to
>     differentiate normal zip files from OOXML MS Office documents.
>
>     The mime.type attribute continues to be set, though some mime types
>     have changed due to Tika naming them differently. In addition,
>     the mime.extension attribute is set to provide the commonly used
>     extension for the mime type (if known).
>     ```
>
>     Some additional notes about this commit:
>
>     I removed the IDENTIFY_ZIP and IDENTIFY_TAR properties. Keeping IDENTIFY_ZIP doesn't make sense because Tika is designed to identify container formats like zip files. Excluding zip files from detection would exclude a number of common mime types, which seems like undesirable behavior. IDENTIFY_TAR is in a similar situation.
>
>     Also, in both cases, the previous code would "identify" a zip or tar file by attempting to open them with Zip and Tar readers. I believe Tika will use magic byte detection as a filtering mechanism to avoid applying deep inspection logic (ie opening the zip with a reader) when not necessary.
>
>     It takes about 2 seconds to bring up the Tika detectors, which makes the tests run longer, but I believe the detection itself is roughly in the same performance category. The code shares a Tika config and list of detectors to minimize the performance impact related to bringing up detectors.
>
>     I also replaced the test resource `1.tar` with a version created by a modern version of tar. The previous tar didn't use the <a href="http://en.wikipedia.org/wiki/Tar_%28computing%29#UStar_format">ustar format</a>, which was standardized in 1988. Tika also couldn't identify the previous tar using magic byte
>     detection.
>
>     And finally, a few of the detected mime types changed names due to Tika naming them differently.
>
> You can merge this pull request into a Git repository by running:
>
>     $ git pull https://github.com/adamonduty/incubator-nifi NIFI-296-extend-IdentifyMimeType
>
> Alternatively you can review and apply these changes as the patch at:
>
>     https://github.com/apache/incubator-nifi/pull/27.patch
>
> To close this pull request, make a commit to your master/trunk branch
> with (at least) the following in the commit message:
>
>     This closes #27
>
> ----
> commit 16fb2b826c0cd983b5d905ceed7aff2a84383d33
> Author: Adam Lamar <[hidden email]>
> Date:   2015-02-14T20:57:41Z
>
>     NIFI-296: Extend capability of IdentifyMimeType
>
>     This commit backs IdentifyMimeType with the Apache Tika library. Tika
>     provides detailed mime type identification such as the ability to
>     differentiate normal zip files from OOXML MS Office documents.
>
>     The mime.type attribute continues to be set, though some mime types
>     have changed due to Tika naming them differently. In addition,
>     the mime.extension attribute is set to provide the commonly used
>     extension for the mime type (if known).
>
> ----
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at [hidden email] or file a JIRA ticket
> with INFRA.
> ---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user adamonduty commented on the pull request:

    https://github.com/apache/incubator-nifi/pull/27#issuecomment-75827104
 
    I agree - that is too much bloat. We could depend on tika-core only, but that provides much more basic functionality when it comes to container detection. Let me investigate a tika-core only solution and some alternatives and get back to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user adamonduty commented on the pull request:

    https://github.com/apache/incubator-nifi/pull/27#issuecomment-76627675
 
    The latest commits only depend upon tika-core, which brings the nifi-standard-nar to 13 MB (instead of 37).
   
    I added sample OOXML office docs from <a href="http://digitalcorpora.org/corpora/govdocs">govdocs</a>, which should be redistributable.
   
    I also added tests for the mime.extension property.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user markap14 commented on the pull request:

    https://github.com/apache/incubator-nifi/pull/27#issuecomment-76731241
 
    Adam,
   
    Excellent job on this! I sent an e-mail to the digitalcorpora people via their 'contact us' page to ask for info on their license. I can't find anything listed. They indicate that it's free to download and distribute but it could be licensed under GPL or something like that, which is not OK. But if they indicate that it's a compatible license, then I will merge this into develop.
   
    This is a lot simpler than my terrible implementation and likely a lot more powerful!
   
    Many thanks for putting in all the hard work to get this going!
   
    -Mark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] incubator-nifi pull request: NIFI-296: Extend capability of Identi...

JPercivall
In reply to this post by JPercivall
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-nifi/pull/27


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---