Revision history [back]

click to hide/show revision 1
initial version

answered 2015-09-01 07:55:08 +0800

cvarona gravatar image cvarona

Hi,

there is no such thing as a text file with no encoding. I guess you mean "when the file has an encoding other than UTF-8". In this case there is not much you can do, and what you can do is fairly clumsy, because zk neglects to preserve the charset in the media object's content type. I'll post about this later, in this forum, for I have similar problems.

Now, this is what you can do:

  1. Write a custom charset finder class and set it up in the system-config section of your zk.xml. Take a look at this and then implement the interface. I use apache's Tika in order to perform actual charset detection. If you use maven 3.2.1 or greater you can apply this dependency

    <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.10</version> <exclusions> <exclusion> <groupId></groupId> <artifactId></artifactId> </exclusion> </exclusions> </dependency>

and this code:

import org.apache.tika.parser.txt.CharsetDetector;
import org.zkoss.zk.ui.util.CharsetFinder;

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;

public class TikaCharsetDetector implements CharsetFinder {

    public String getCharset( String contentType, InputStream stream ) throws IOException {

        CharsetDetector charsetDetector = new CharsetDetector();
        charsetDetector.setText( stream instanceof BufferedInputStream ?
                        stream :
                        new BufferedInputStream( stream )
        );

        charsetDetector.enableInputFilter( true );

        return charsetDetector.detect().getName();
    }
}

This will ensure the text within your media object is correct as long as you read with getStringData() or getReader().

  1. If by whichever reason you cannot read the media object's content as characters and are forced to read them as raw bytes, you need to know what charset was actually involved when constructing it. Neither AMedia nor ReaderMedia carry this information. The former removes the charset from the content type by reasons completely unknown to me, the latter innerly contains the charset but does not display it publicly. The only solution involves writing a custom AuUploader that creates instances of wrapper classes containing this information. This section will tell you how to write and configure your custom AuUploader. As for how you should customize it, write AMedia and ReaderMedia wrappers that can accept and display charset and then use them in this piece of code of the processItem method:

    } else if (ctypelc.startsWith("text/")) { String charset = getCharset(ctype); if (charset == null) { final Configuration conf = desktop.getWebApp().getConfiguration(); final CharsetFinder chfd = conf.getUploadCharsetFinder(); if (chfd != null) charset = chfd.getCharset(ctype, fi.isInMemory() ? new ByteArrayInputStream(fi.get()): fi.getInputStream()); if (charset == null) charset = conf.getUploadCharset(); } return fi.isInMemory() ? new AMedia(name, null, ctype, fi.getString(charset)): new ReaderMedia(name, null, ctype, fi, charset); }

Hope it helps.

click to hide/show revision 2
Guess-explanation of why preformatted source code does not appear as such

Hi,

there is no such thing as a text file with no encoding. I guess you mean "when the file has an encoding other than UTF-8". In this case there is not much you can do, and what you can do is fairly clumsy, because zk neglects to preserve the charset in the media object's content type. I'll post about this later, in this forum, for I have similar problems.

Now, this is what you can do:

  1. Write a custom charset finder class and set it up in the system-config section of your zk.xml. Take a look at this and then implement the interface. I use apache's Tika in order to perform actual charset detection. If you use maven 3.2.1 or greater you can apply this dependency

    <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.10</version> <exclusions> <exclusion> <groupId></groupId> <artifactId></artifactId> </exclusion> </exclusions> </dependency>

and this code:

import org.apache.tika.parser.txt.CharsetDetector;
import org.zkoss.zk.ui.util.CharsetFinder;

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;

public class TikaCharsetDetector implements CharsetFinder {

    public String getCharset( String contentType, InputStream stream ) throws IOException {

        CharsetDetector charsetDetector = new CharsetDetector();
        charsetDetector.setText( stream instanceof BufferedInputStream ?
                        stream :
                        new BufferedInputStream( stream )
        );

        charsetDetector.enableInputFilter( true );

        return charsetDetector.detect().getName();
    }
}

This will ensure the text within your media object is correct as long as you read with getStringData() or getReader().

  1. If by whichever reason you cannot read the media object's content as characters and are forced to read them as raw bytes, you need to know what charset was actually involved when constructing it. Neither AMedia nor ReaderMedia carry this information. The former removes the charset from the content type by reasons completely unknown to me, the latter innerly contains the charset but does not display it publicly. The only solution involves writing a custom AuUploader that creates instances of wrapper classes containing this information. This section will tell you how to write and configure your custom AuUploader. As for how you should customize it, write AMedia and ReaderMedia wrappers that can accept and display charset and then use them in this piece of code of the processItem method:

    } else if (ctypelc.startsWith("text/")) { String charset = getCharset(ctype); if (charset == null) { final Configuration conf = desktop.getWebApp().getConfiguration(); final CharsetFinder chfd = conf.getUploadCharsetFinder(); if (chfd != null) charset = chfd.getCharset(ctype, fi.isInMemory() ? new ByteArrayInputStream(fi.get()): fi.getInputStream()); if (charset == null) charset = conf.getUploadCharset(); } return fi.isInMemory() ? new AMedia(name, null, ctype, fi.getString(charset)): new ReaderMedia(name, null, ctype, fi, charset); }

Hope it helps.

[Sorry, it looks like this forum only allows one preformatted block per answer]

Support Options
  • Email Support
  • Training
  • Consulting
  • Outsourcing
Learn More