Hi,
UTF files sometimes contain a marker at the begining of the file called a BOM marker. It's maximum 4 bytes long and simply identifies the type of UTF file. Apparently the HXTT CSV driver doesn't take that BOM into account when opening the file leading to problems when performing queries. Could you please add support for these types of UTF files? I have added a piece of code on how to do the identification and skipping for your convenience. The code simply identifies the BOM marker and skips the BOM if one is found, it then returns an InputStreamReader that is positioned just after the BOM.
Thanks and best regards,
Frank
/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are unread back
* to the stream, only BOM bytes are skipped.
*
* @param oInputStream the original input stream.
* @param sEncoding the given encoding.
* @return InputStreamReader an input stream reader after scanning for BOM and special UTF encoding schemes.
* The encoding passed to the input srteam reader is adjusted if necessary.
*/
public static InputStreamReader checkBOMAndAdjustEncoding(InputStream oInputStream, String sEncoding) throws IOException {
final int BOM_SIZE = 4;
PushbackInputStream oPushbackStream = null;
byte aBOM[];
int nBytesRead;
int nBytesToUnread;
oPushbackStream = new PushbackInputStream(oInputStream, BOM_SIZE);
aBOM = new byte[BOM_SIZE];
nBytesRead = oPushbackStream.read(aBOM, 0, aBOM.length);
// Check for BOM
if ((aBOM[0] == (byte) 0xEF) && (aBOM[1] == (byte) 0xBB) && (aBOM[2] == (byte) 0xBF)) {
sEncoding = "UTF-8";
nBytesToUnread = nBytesRead - 3;
}
else if ((aBOM[0] == (byte) 0xFE) && (aBOM[1] == (byte) 0xFF)) {
sEncoding = "UTF-16BE";
nBytesToUnread = nBytesRead - 2;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte) 0xFE)) {
sEncoding = "UTF-16LE";
nBytesToUnread = nBytesRead - 2;
}
else if ((aBOM[0] == (byte) 0x00) && (aBOM[1] == (byte) 0x00) &&
(aBOM[2] == (byte) 0xFE) && (aBOM[3] == (byte) 0xFF)) {
sEncoding = "UTF-32BE";
nBytesToUnread = nBytesRead - 4;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte)0xFE) &&
(aBOM[2] == (byte) 0x00) && (aBOM[3] == (byte) 0x00)) {
sEncoding = "UTF-32LE";
nBytesToUnread = nBytesRead - 4;
}
else {
// Unicode BOM mark not found, unead all bytes.
nBytesToUnread = nBytesRead;
}
if (nBytesToUnread > 0) oPushbackStream.unread(aBOM, (nBytesRead - nBytesToUnread), nBytesToUnread);
else if (nBytesToUnread < -1) oPushbackStream.unread(aBOM, 0, 0);
// Use calculated encoding.
return new InputStreamReader(oPushbackStream, sEncoding);
}
|
Please unload a data file sample:)
ftp site: ftp.hxtt.com
ftp user: anonymous@hxtt.com
ftp password: (empty)
login mode: normal (not anonymous)
ftp port:21
upload directory: incoming
After upload, you can't see that upload file, but it has been upload.
then notify us through webmaster@hxtt.com .
BTW, what's your data file type(text, csv, psv, or tsv)?
|
File was uploaded. The file inn UTF-8 format with BOM signature, the contents is tab delimited without quotes around the strings.
Frank
|
Thanks. If you have UTF-16BE sample or other sample, you can upload it too:)
We will complement that support soon:)
|
Thanks again for the quick response, like I said in the mail I sent to webmaster@hxtt.com, you can download a free version of emeditor (http://www.emeditor.com/download.htm) to play around with these signatures. I think I covered all of the possibilities in my code snippet.
Best regards,
Frank
|
Just a hint, there's a bug in your function for UTF-16LE and UTF-32LE:
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte) 0xFE)) {
sEncoding = "UTF-16LE";
nBytesToUnread = nBytesRead - 2;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte)0xFE) &&
(aBOM[2] == (byte) 0x00) && (aBOM[3] == (byte) 0x00)) {//Won't be reached forever
sEncoding = "UTF-32LE";
nBytesToUnread = nBytesRead - 4;
}
It should be:
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte)0xFE) &&
(aBOM[2] == (byte) 0x00) && (aBOM[3] == (byte) 0x00)) {//Now it can be reached forever
sEncoding = "UTF-32LE";
nBytesToUnread = nBytesRead - 4;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte) 0xFE)) {
sEncoding = "UTF-16LE";
nBytesToUnread = nBytesRead - 2;
}
|
v3.0.019 supports to detect automatically and utilize byte-order marker (BOM) in UTF file for CSV, TSV, and PSV.
You can use:
url="jdbc:text:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=-1"
Or
url="jdbc:text:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=0"
Then:
select * from UTF8_BOM_SAMPLE.txt
or
select * from UTF8_BOM_SAMPLE
Please download the latest JDBC30 package. The JDBC20, and JDBC12 packages will be availabe in 2 hours.
|
The preferable url is:
url="jdbc:csv:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=-1"
Or
url="jdbc:csv:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=0"
|
Thanks for the hint, this was indeed a bug:)
Always nice to help each other out...
Frank
|