1

Button upload loses charset

asked 2015-07-06 09:13:52 +0800

xcapdevila gravatar image xcapdevila
13 2

updated 2015-07-06 09:14:23 +0800

Hi all,

I'm using zk 7 and I can't encode an uploaded csv file.

When I upload the file I get ? on special characters. Default encoding is set to UTF-8 and it's working fine when the file is encoded that way but when the file has no enconding I can't force it to UTF-8 because when I get the stream or byte array the ? are already there...

I've tried taking it as text:

<z:button label="${msg.backend.actions.upload}" upload="true,maxsize=-1,multiple=false"
                    onUpload="@command('upload')" />

and as native:

<z:button label="${msg.backend.actions.upload}" upload="true,maxsize=-1,multiple=false,native"
                    onUpload="@command('upload')" />

but I always get the same result: ? on special characters.

Java code:

(in = media.getByteData())

String utf8Data = new String(in, Charsets.UTF_8);

StringBuilder sb = new StringBuilder();
char[] buffer = new char[4096];
InputStreamReader reader = new InputStreamReader(new ByteArrayInputStream(in), Charsets.UTF_8);
int charsRead;
while ((charsRead = reader.read(buffer)) != -1) {
 sb.append(buffer, 0, charsRead);
}
String utf8Data2 = new String(sb.toString().getBytes(Charsets.UTF_8));

Any ideas?

Many thanks.

delete flag offensive retag edit

2 Answers

Sort by ยป oldest newest most voted
0

answered 2015-07-06 10:29:56 +0800

Darksu gravatar image Darksu
1991 1 4

Hello xcapdevila,

Encoding is always a big issue.

Since you do not know what encoding the file has, then maybe you should make an attempt to find it out and then try to read it based on the encoding type:

http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream

https://code.google.com/p/juniversalchardet/

I've got some more ideas. So please let me know how it went. Also is there any chance you can upload the test files?

Best Regards,

Darksu

link publish delete flag offensive edit
0

answered 2015-09-01 07:55:08 +0800

cvarona gravatar image cvarona
554 1 6

updated 2015-09-01 07:56:22 +0800

Hi,

there is no such thing as a text file with no encoding. I guess you mean "when the file has an encoding other than UTF-8". In this case there is not much you can do, and what you can do is fairly clumsy, because zk neglects to preserve the charset in the media object's content type. I'll post about this later, in this forum, for I have similar problems.

Now, this is what you can do:

  1. Write a custom charset finder class and set it up in the system-config section of your zk.xml. Take a look at this and then implement the interface. I use apache's Tika in order to perform actual charset detection. If you use maven 3.2.1 or greater you can apply this dependency

    <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.10</version> <exclusions> <exclusion> <groupId></groupId> <artifactId></artifactId> </exclusion> </exclusions> </dependency>

and this code:

import org.apache.tika.parser.txt.CharsetDetector;
import org.zkoss.zk.ui.util.CharsetFinder;

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;

public class TikaCharsetDetector implements CharsetFinder {

    public String getCharset( String contentType, InputStream stream ) throws IOException {

        CharsetDetector charsetDetector = new CharsetDetector();
        charsetDetector.setText( stream instanceof BufferedInputStream ?
                        stream :
                        new BufferedInputStream( stream )
        );

        charsetDetector.enableInputFilter( true );

        return charsetDetector.detect().getName();
    }
}

This will ensure the text within your media object is correct as long as you read with getStringData() or getReader().

  1. If by whichever reason you cannot read the media object's content as characters and are forced to read them as raw bytes, you need to know what charset was actually involved when constructing it. Neither AMedia nor ReaderMedia carry this information. The former removes the charset from the content type by reasons completely unknown to me, the latter innerly contains the charset but does not display it publicly. The only solution involves writing a custom AuUploader that creates instances of wrapper classes containing this information. This section will tell you how to write and configure your custom AuUploader. As for how you should customize it, write AMedia and ReaderMedia wrappers that can accept and display charset and then use them in this piece of code of the processItem method:

    } else if (ctypelc.startsWith("text/")) { String charset = getCharset(ctype); if (charset == null) { final Configuration conf = desktop.getWebApp().getConfiguration(); final CharsetFinder chfd = conf.getUploadCharsetFinder(); if (chfd != null) charset = chfd.getCharset(ctype, fi.isInMemory() ? new ByteArrayInputStream(fi.get()): fi.getInputStream()); if (charset == null) charset = conf.getUploadCharset(); } return fi.isInMemory() ? new AMedia(name, null, ctype, fi.getString(charset)): new ReaderMedia(name, null, ctype, fi, charset); }

Hope it helps.

[Sorry, it looks like this forum only allows one preformatted block per answer]

link publish delete flag offensive edit
Your answer
Please start posting your answer anonymously - your answer will be saved within the current session and published after you log in or create a new account. Please try to give a substantial answer, for discussions, please use comments and please do remember to vote (after you log in)!

[hide preview]

Question tools

Follow
2 followers

RSS

Stats

Asked: 2015-07-06 09:13:52 +0800

Seen: 39 times

Last updated: Sep 01 '15

Support Options
  • Email Support
  • Training
  • Consulting
  • Outsourcing
Learn More