Charset (Java Platform SE 6)

If a charset listed in the IANA Charset Registry is supported by an implementation of the Java platform then its canonical name must be the name listed in the registry. Many charsets are given more than one name in the registry, in which case the registry identifies one of the names as MIME-preferred. If a charset has more than one registry name then its canonical name must be the MIME-preferred name and the other names in the registry must be valid aliases. If a supported charset is not listed in the IANA registry then its canonical name must begin with one of the strings "X-" or "x-".

The IANA charset registry does change over time, and so the canonical name and the aliases of a particular charset may also change over time. To ensure compatibility it is recommended that no alias ever be removed from a charset, and that if the canonical name of a charset is changed then its previous canonical name be made into an alias.

Standard charsets

Every implementation of the Java platform is required to support the following standard charsets. Consult the release documentation for your implementation to see if any other charsets are supported. The behavior of such optional charsets may differ between implementations.

Charset

Description

US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1   ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8 Eight-bit UCS Transformation Format
UTF-16BE Sixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

The UTF-8 charset is specified by RFC 2279; the transformation format upon which it is based is specified in Amendment 2 of ISO 10646-1 and is also described in the Unicode Standard.

The UTF-16 charsets are specified by RFC 2781; the transformation formats upon which they are based are specified in Amendment 1 of ISO 10646-1 and are also described in the Unicode Standard.

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:

  • When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.

  • When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

In any case, when a byte-order mark is read at the beginning of a decoding operation it is omitted from the resulting sequence of characters. Byte order marks occuring after the first element of an input sequence are not omitted since the same code is used to represent ZERO-WIDTH NON-BREAKING SPACE.

Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

Terminology

The name of this class is taken from the terms used in RFC 2278. In that document a charset is defined as the combination of a coded character set and a character-encoding scheme.

A coded character set is a mapping between a set of abstract characters and a set of integers. US-ASCII, ISO 8859-1, JIS X 0201, and full Unicode, which is the same as ISO 10646-1, are examples of coded character sets.

A character-encoding scheme is a mapping between a coded character set and a set of octet (eight-bit byte) sequences. UTF-8, UCS-2, UTF-16, ISO 2022, and EUC are examples of character-encoding schemes. Encoding schemes are often associated with a particular coded character set; UTF-8, for example, is used only to encode Unicode. Some schemes, however, are associated with multiple character sets; EUC, for example, can be used to encode characters in a variety of Asian character sets.

When a coded character set is used exclusively with a single character-encoding scheme then the corresponding charset is usually named for the character set; otherwise a charset is usually named for the encoding scheme and, possibly, the locale of the character sets that it supports. Hence US-ASCII is the name of the charset for US-ASCII while EUC-JP is the name of the charset that encodes the JIS X 0201, JIS X 0208, and JIS X 0212 character sets.

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units and sequences of bytes.

Since:
1.4
See Also:
CharsetDecoder, CharsetEncoder, CharsetProvider, Character

Constructor Summary
protected Charset(String canonicalName, String[] aliases)
          Initializes a new charset with the given canonical name and alias set.
 
Method Summary
 Set<String> aliases()
          Returns a set containing this charset's aliases.
static SortedMap<String,Charset> availableCharsets()
          Constructs a sorted map from canonical charset names to charset objects.
 boolean canEncode()
          Tells whether or not this charset supports encoding.
 int compareTo(Charset that)
          Compares this charset to another.
abstract  boolean contains(Charset cs)
          Tells whether or not this charset contains the given charset.
 CharBuffer decode(ByteBuffer bb)
          Convenience method that decodes bytes in this charset into Unicode characters.
static Charset defaultCharset()
          Returns the default charset of this Java virtual machine.
 String displayName()
          Returns this charset's human-readable name for the default locale.
 String displayName(Locale locale)
          Returns this charset's human-readable name for the given locale.
 ByteBuffer encode(CharBuffer cb)
          Convenience method that encodes Unicode characters into bytes in this charset.
 ByteBuffer encode(String str)
          Convenience method that encodes a string into bytes in this charset.
 boolean equals(Object ob)
          Tells whether or not this object is equal to another.
static Charset forName(String charsetName)
          Returns a charset object for the named charset.
 int hashCode()
          Computes a hashcode for this charset.
 boolean isRegistered()
          Tells whether or not this charset is registered in the IANA Charset Registry.
static boolean isSupported(String charsetName)
          Tells whether the named charset is supported.
 String name()
          Returns this charset's canonical name.
abstract  CharsetDecoder newDecoder()
          Constructs a new decoder for this charset.
abstract  CharsetEncoder newEncoder()
          Constructs a new encoder for this charset.
 String toString()
          Returns a string describing this charset.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 
Constructor Detail

Charset

protected Charset(String canonicalName,
                  String[] aliases)
Initializes a new charset with the given canonical name and alias set.