Main   Products   Offshore Outsourcing   Customers   Partners   ContactUs  
JDBC Databases
  HXTT Access v5.2
  HXTT Cobol v2.1
  HXTT DBF v5.2
  HXTT Excel v4.2
  HXTT Paradox v5.2
  HXTT Text(CSV) v5.2
 
  Buy Now
  Support
  Download
  Document
  FAQ
  HXTT XML v1.2
Offshore Outsourcing
Oracle Data Import/Export
DB2 Data Import/Export
Sybase Data Import/Export
Free Resources
  Firewall Tunneling
  Search Indexing Robot
  Conditional Compilation
  Password Recovery for MS Access
  Password Recovery for Corel Paradox
  Checksum Tool for MD5
  Character Set Converter
  Pyramid - Poker of ZYH
   
   
   
Hongxin Technology & Trade Ltd. of Xiangtan City (abbr, HXTT)

HXTT Text(CSV)
UTF BOM marker
Frank
2006-09-17 02:51:06.0
Hi,

UTF files sometimes contain a marker at the begining of the file called a BOM marker. It's maximum 4 bytes long and simply identifies the type of UTF file. Apparently the HXTT CSV driver doesn't take that BOM into account when opening the file leading to problems when performing queries. Could you please add support for these types of UTF files? I have added a piece of code on how to do the identification and skipping for your convenience. The code simply identifies the BOM marker and skips the BOM if one is found, it then returns an InputStreamReader that is positioned just after the BOM.

Thanks and best regards,
Frank

/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are unread back
* to the stream, only BOM bytes are skipped.
*
* @param oInputStream the original input stream.
* @param sEncoding the given encoding.
* @return InputStreamReader an input stream reader after scanning for BOM and special UTF encoding schemes.
* The encoding passed to the input srteam reader is adjusted if necessary.
*/
public static InputStreamReader checkBOMAndAdjustEncoding(InputStream oInputStream, String sEncoding) throws IOException {
final int BOM_SIZE = 4;
PushbackInputStream oPushbackStream = null;
byte aBOM[];
int nBytesRead;
int nBytesToUnread;

oPushbackStream = new PushbackInputStream(oInputStream, BOM_SIZE);
aBOM = new byte[BOM_SIZE];
nBytesRead = oPushbackStream.read(aBOM, 0, aBOM.length);
// Check for BOM
if ((aBOM[0] == (byte) 0xEF) && (aBOM[1] == (byte) 0xBB) && (aBOM[2] == (byte) 0xBF)) {
sEncoding = "UTF-8";
nBytesToUnread = nBytesRead - 3;
}
else if ((aBOM[0] == (byte) 0xFE) && (aBOM[1] == (byte) 0xFF)) {
sEncoding = "UTF-16BE";
nBytesToUnread = nBytesRead - 2;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte) 0xFE)) {
sEncoding = "UTF-16LE";
nBytesToUnread = nBytesRead - 2;
}
else if ((aBOM[0] == (byte) 0x00) && (aBOM[1] == (byte) 0x00) &&
(aBOM[2] == (byte) 0xFE) && (aBOM[3] == (byte) 0xFF)) {
sEncoding = "UTF-32BE";
nBytesToUnread = nBytesRead - 4;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte)0xFE) &&
(aBOM[2] == (byte) 0x00) && (aBOM[3] == (byte) 0x00)) {
sEncoding = "UTF-32LE";
nBytesToUnread = nBytesRead - 4;
}
else {
// Unicode BOM mark not found, unead all bytes.
nBytesToUnread = nBytesRead;
}
if (nBytesToUnread > 0) oPushbackStream.unread(aBOM, (nBytesRead - nBytesToUnread), nBytesToUnread);
else if (nBytesToUnread < -1) oPushbackStream.unread(aBOM, 0, 0);
// Use calculated encoding.
return new InputStreamReader(oPushbackStream, sEncoding);
}
Re:UTF BOM marker
HXTT Support
2006-09-17 03:02:51.0
Please unload a data file sample:)
ftp site: ftp.hxtt.com
ftp user: anonymous@hxtt.com
ftp password: (empty)
login mode: normal (not anonymous)
ftp port:21
upload directory: incoming
After upload, you can't see that upload file, but it has been upload.

then notify us through webmaster@hxtt.com .

BTW, what's your data file type(text, csv, psv, or tsv)?
Re:Re:UTF BOM marker
Frank
2006-09-17 03:09:42.0
File was uploaded. The file inn UTF-8 format with BOM signature, the contents is tab delimited without quotes around the strings.

Frank
Re:Re:Re:UTF BOM marker
HXTT Support
2006-09-17 03:18:28.0
Thanks. If you have UTF-16BE sample or other sample, you can upload it too:)
We will complement that support soon:)
Re:Re:Re:Re:UTF BOM marker
Frank
2006-09-17 03:24:27.0
Thanks again for the quick response, like I said in the mail I sent to webmaster@hxtt.com, you can download a free version of emeditor (http://www.emeditor.com/download.htm) to play around with these signatures. I think I covered all of the possibilities in my code snippet.

Best regards,
Frank
Re:Re:Re:Re:Re:UTF BOM marker
HXTT Support
2006-09-17 04:16:15.0
Just a hint, there's a bug in your function for UTF-16LE and UTF-32LE:
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte) 0xFE)) {
sEncoding = "UTF-16LE";
nBytesToUnread = nBytesRead - 2;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte)0xFE) &&
(aBOM[2] == (byte) 0x00) && (aBOM[3] == (byte) 0x00)) {//Won't be reached forever
sEncoding = "UTF-32LE";
nBytesToUnread = nBytesRead - 4;
}

It should be:
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte)0xFE) &&
(aBOM[2] == (byte) 0x00) && (aBOM[3] == (byte) 0x00)) {//Now it can be reached forever
sEncoding = "UTF-32LE";
nBytesToUnread = nBytesRead - 4;
}
else if ((aBOM[0] == (byte) 0xFF) && (aBOM[1] == (byte) 0xFE)) {
sEncoding = "UTF-16LE";
nBytesToUnread = nBytesRead - 2;
}
Re:Re:Re:Re:Re:Re:UTF BOM marker
HXTT Support
2006-09-17 06:36:04.0
v3.0.019 supports to detect automatically and utilize byte-order marker (BOM) in UTF file for CSV, TSV, and PSV.
You can use:
url="jdbc:text:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=-1"
Or
url="jdbc:text:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=0"

Then:
select * from UTF8_BOM_SAMPLE.txt
or
select * from UTF8_BOM_SAMPLE

Please download the latest JDBC30 package. The JDBC20, and JDBC12 packages will be availabe in 2 hours.
Re:Re:Re:Re:Re:Re:Re:UTF BOM marker
HXTT Support
2006-09-17 06:38:11.0
The preferable url is:
url="jdbc:csv:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=-1"
Or
url="jdbc:csv:////f:/textfiles?_CSV_Separator=\t;_CSV_Header=true;csvfileExtension=TXT;_CSV_Quoter=;maxScanRows=0"
Re:Re:Re:Re:Re:Re:Re:Re:UTF BOM marker
Frank
2006-09-17 15:00:03.0
Thanks for the hint, this was indeed a bug:)

Always nice to help each other out...

Frank

Search Key   Search by Last 50 Questions




Google
 

Address: 9 Station Rd., Xiangtan City, Hunan Province, P.R. China
Postcode: 411100
Phone: (86)731-58225727
Fax: (86)731-58225727
Email: webmaster@hxtt.com
Copyright © 1999-2011 Hongxin Technology & Trade Ltd. | All Rights Reserved. | Privacy | Legal | Sitemap