Character sets. More...

#include <vtkDICOMCharacterSet.h>

Public Types
enum	EnumType { ISO_IR_6 = 0 , ISO_IR_13 = 1 , ISO_IR_100 = 8 , ISO_IR_101 = 9 , ISO_IR_109 = 10 , ISO_IR_110 = 11 , ISO_IR_144 = 12 , ISO_IR_127 = 13 , ISO_IR_126 = 14 , ISO_IR_138 = 15 , ISO_IR_148 = 16 , X_LATIN6 = 17 , ISO_IR_166 = 18 , X_LATIN7 = 19 , X_LATIN8 = 20 , ISO_IR_203 = 21 , X_LATIN9 = 21 , X_LATIN10 = 22 , X_EUCKR = 24 , X_GB2312 = 25 , ISO_2022_IR_6 = 32 , ISO_2022_IR_13 = 33 , ISO_2022_IR_87 = 34 , ISO_2022_IR_13_87 = 35 , ISO_2022_IR_159 = 36 , ISO_2022_IR_87_159 = 38 , ISO_2022_IR_13_87_159 = 39 , ISO_2022_IR_100 = 40 , ISO_2022_IR_101 = 41 , ISO_2022_IR_109 = 42 , ISO_2022_IR_110 = 43 , ISO_2022_IR_144 = 44 , ISO_2022_IR_127 = 45 , ISO_2022_IR_126 = 46 , ISO_2022_IR_138 = 47 , ISO_2022_IR_148 = 48 , ISO_2022_IR_166 = 50 , ISO_2022_IR_203 = 53 , ISO_2022_IR_149 = 56 , ISO_2022_IR_58 = 57 , X_ISO_2022_JP = 58 , X_ISO_2022_JP_1 = 59 , X_ISO_2022_JP_2 = 60 , X_ISO_2022_JP_EXT = 61 , ISO_IR_192 = 64 , GB18030 = 65 , GBK = 66 , X_BIG5 = 67 , X_EUCJP = 69 , X_SJIS = 70 , X_CP874 = 76 , X_CP1250 = 80 , X_CP1251 = 81 , X_CP1252 = 82 , X_CP1253 = 83 , X_CP1254 = 84 , X_CP1255 = 85 , X_CP1256 = 86 , X_CP1257 = 87 , X_CP1258 = 88 , X_KOI8 = 90 , Unknown = 255 }

Public Member Functions
	vtkDICOMCharacterSet ()
	Construct an object that describes the default (ASCII) character set.

	vtkDICOMCharacterSet (int k)
	Construct a character set object from a given code. More...

	vtkDICOMCharacterSet (const std::string &name)
	Construct a character set object from a SpecificCharacterSet value. More...

	vtkDICOMCharacterSet (const char *name, size_t nl)

std::string	GetCharacterSetString () const
	Generate SpecificCharacterSet code values (diagnostic only). More...

const char *	GetDefinedTerm () const
	Get the defined term (possible multi-valued) for this character set. More...

const char *	GetMIMEName () const
	Get the internet MIME name for this character set. More...

const char *	GetName () const
	Get a name that identifies this character set. More...

unsigned char	GetKey () const
	Get the numerical code for this character set object.

std::string	FromUTF8 (const char text, size_t l, size_t lp=nullptr) const
	Convert text from UTF-8 to this encoding. More...

std::string	FromUTF8 (const std::string &text) const

std::string	ToUTF8 (const char text, size_t l, size_t lp=nullptr) const
	Convert text from this encoding to UTF-8. More...

std::string	ToUTF8 (const std::string &text) const

std::string	ConvertToUTF8 (const char *text, size_t l) const
	Obsolete method for converting to UTF8.

std::string	ToSafeUTF8 (const char *text, size_t l) const
	Convert text to UTF-8 that is safe to print to the console. More...

std::string	ToSafeUTF8 (const std::string &text) const

std::string	CaseFoldedUTF8 (const char *text, size_t l) const
	Convert text into a form suitable for case-insensitive matching. More...

std::string	CaseFoldedUTF8 (const std::string &text) const

bool	IsISO2022 () const
	Returns true if ISO 2022 escape codes are used. More...

bool	IsISO8859 () const
	Returns true if this uses an ISO 8859 code page.

bool	IsBiDirectional () const
	Check for bidirectional character sets. More...

unsigned int	CountBackslashes (const char *text, size_t l) const
	Count the number of backslashes in an encoded string. More...

size_t	NextBackslash (const char text, const char end) const
	Get the offset to the next backslash, or to the end of the string. More...

bool	operator== (vtkDICOMCharacterSet b) const

bool	operator!= (vtkDICOMCharacterSet b) const

bool	operator<= (vtkDICOMCharacterSet a) const

bool	operator>= (vtkDICOMCharacterSet a) const

bool	operator< (vtkDICOMCharacterSet a) const

bool	operator> (vtkDICOMCharacterSet a) const

Static Public Member Functions
static void	SetGlobalDefault (vtkDICOMCharacterSet cs)
	Set the character set to use if SpecificCharacterSet is missing. More...

static vtkDICOMCharacterSet	GetGlobalDefault ()

static void	SetGlobalOverride (bool b)
	Override the value stored in SpecificCharacterSet with the default. More...

static void	GlobalOverrideOn ()

static void	GlobalOverrideOff ()

static bool	GetGlobalOverride ()

Detailed Description

Character sets.

This class provides the means to convert the various international text encodings used by DICOM to UTF-8 and back again.

During conversion to UTF-8, any codes from the original encoding that can't be converted are replaced by Unicode's "REPLACEMENT CHARACTER", which is a question mark in a black diamond. For instance, if the original encoding is ISO_IR_6 (ASCII), any octets outside of the valid ASCII range of 0 to 127 will become "REPLACEMENT CHARACTER".

DICOM supports a fairly small number of single-byte and multi-byte character sets. The only VRs that support these character sets are PN, LO, SH, ST, LT, and ST (all other text VRs must be ASCII). In addition to ASCII, there are twelve 8-bit single-byte encodings, three iso-2022 multi-byte encodings, and three variable-length encodings (UTF-8, GB18030, GBK).

In some DICOM data sets, especially old ones, the SpecificCharacterSet attribute will be missing and it might be necessary to manually specify a character set for the application to use. Use SetGlobalDefault() to do so. The vtkDICOMCharacterSet constructor can take the desired character encoding as a string, where the following encodings are allowed: 'ascii', 'latin1', 'latin2', 'latin3', 'latin4', 'latin5' 'latin7', 'latin9', 'cyrillic' (iso-8859-5), 'arabic' (iso-8859-6), 'greek' (iso-8859-7), 'hebrew' (iso-8859-8), 'tis-620', 'shift-jis', 'euc-jp', 'iso-2022-jp', 'korean' (euc-kr), 'chinese' (gb2312), 'gbk', 'gb18030', 'big5', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', and 'utf-8'. Common aliases of these character sets can also be used.

Constructor & Destructor Documentation

◆ vtkDICOMCharacterSet() [1/2]

vtkDICOMCharacterSet::vtkDICOMCharacterSet ( int k )

inline

Construct a character set object from a given code.

The code can be any of the enumerated code values. The ISO_2022 flag can be added to any of the ISO-8859 codes to indicate that the character set allows the use of escape codes. Also note that ISO_2022_IR_87 and ISO_2022_IR_159 are combining codes that can be added to each other and to ISO_IR_13. Specifying any other codes in combination can lead to undefined results, for example "ISO_2022_IR_100 | ISO_2022_IR_101" is not permitted and "ISO_2022_IR_100" must be used instead.

◆ vtkDICOMCharacterSet() [2/2]

vtkDICOMCharacterSet::vtkDICOMCharacterSet ( const std::string & name )

inlineexplicit

Construct a character set object from a SpecificCharacterSet value.

This generates an 8-bit code that uniquely identifies a DICOM character set plus its code extensions.

Member Function Documentation

◆ CaseFoldedUTF8()

std::string vtkDICOMCharacterSet::CaseFoldedUTF8	(	const char *	text,
		size_t	l
	)		const

Convert text into a form suitable for case-insensitive matching.

This function will perform case normalization on a string by converting it to lowercase, and by normalizing the forms of lowercase characters that do not have an exact uppercase equivalent. In some cases, it might increase the length of the string. It covers modern European scripts (including Greek and Cyrillic) and latin characters used in East Asian languages.

◆ CountBackslashes()

unsigned int vtkDICOMCharacterSet::CountBackslashes	(	const char *	text,
		size_t	l
	)		const

Count the number of backslashes in an encoded string.

The backslash byte is sometimes present as half of a multibyte character in the Japanese and Chinese encodings. This method skips these false backslashes and counts only real backslashes.

◆ FromUTF8()

std::string vtkDICOMCharacterSet::FromUTF8	(	const char *	text,
		size_t	l,
		size_t *	lp = `nullptr`
	)		const

Convert text from UTF-8 to this encoding.

Attempt to convert from UTF-8 to this character set. Every non-convertible character will be replaced with '?'. If you pass a non-null value for the "lp" parameter, then "lp" will be set to the position in the input UTF-8 string where the first conversion error occurred, and the unconverted character will be output as <U+XXXX> instead of '?'. If the conversion was error-free, then "lp" will be set to the length of the input string.

◆ GetCharacterSetString()

std::string vtkDICOMCharacterSet::GetCharacterSetString ( ) const

Generate SpecificCharacterSet code values (diagnostic only).

This will return the same value as GetDefinedTerm() if a defined term exists, otherwise it return the same value as GetName() if the character set has a name, with a final fallback to the number returned by GetKey() converted to a string.

◆ GetDefinedTerm()

const char* vtkDICOMCharacterSet::GetDefinedTerm ( ) const

Get the defined term (possible multi-valued) for this character set.

If the character set permitted by the DICOM standard, this will return the defined term, otherwise the returned value will be NULL. An empty string is returned for the default character set (ISO_IR 6). Multiple values will be separated by backslashes, e.g. "\\ISO 2022 IR 58" or "ISO 2022 IR 13\\ISO 2022 IR 87".

◆ GetMIMEName()

const char* vtkDICOMCharacterSet::GetMIMEName ( ) const

Get the internet MIME name for this character set.

The return value will be NULL if there isn't a good match between this character set and one of the MIME character sets in common use on the internet. So conversion may be necessary, either to UTF-8 or to a different encoding with a similar character repertoire. For example, "ISO 2022 IR 149" can be converted to "EUC-KR", "ISO 2022 IR 58" can can be converted to "GBK", and "ISO 2022 IR 13\\ISO 2022 IR 87" can be converted to "Shift_JIS". Note that "ISO-2022-JP" is not equivalent to DICOM's Japanese encodings since it does not allow half-width katakana or the "ISO 2022 IR 159" characters.

◆ GetName()

const char* vtkDICOMCharacterSet::GetName ( ) const

Get a name that identifies this character set.

For DICOM character sets, the name is based on the defined term, and for other character sets, the common name is used. If no name exists, then "Unknown" will be returned.

◆ IsBiDirectional()

bool vtkDICOMCharacterSet::IsBiDirectional ( ) const

inline

Check for bidirectional character sets.

This is used to check for character sets that are likely to contain characters that print right-to-left, specifically Hebrew and Arabic. Note that even though some parts of unicode fall into this category, this flag is off for unicode and GB18030/GBK.

◆ IsISO2022()

bool vtkDICOMCharacterSet::IsISO2022 ( ) const

inline

Returns true if ISO 2022 escape codes are used.

If this method returns true, then escape codes can be used to switch between character sets.

◆ NextBackslash()

size_t vtkDICOMCharacterSet::NextBackslash	(	const char *	text,
		const char *	end
	)		const

Get the offset to the next backslash, or to the end of the string.

In order to work properly, this method requires that its input is either at the beginning of the string or just after a backslash.

◆ SetGlobalDefault()

static void vtkDICOMCharacterSet::SetGlobalDefault ( vtkDICOMCharacterSet cs )

inlinestatic

Set the character set to use if SpecificCharacterSet is missing.

Some DICOM files do not list a SpecificCharacterSet attribute, but nevertheless use a non-ASCII character encoding. This method can be used to specify the character set in absence of SpecificCharacterSet. If SpecificCharacterSet is present, the default will not override it unless OverrideCharacterSet is true.

◆ SetGlobalOverride()

static void vtkDICOMCharacterSet::SetGlobalOverride ( bool b )

inlinestatic

Override the value stored in SpecificCharacterSet with the default.

This method can be used if the SpecificCharacterSet attribute of a file is incorrect. It forces the use of the character set that was set with SetGlobalDefault.

◆ ToSafeUTF8()

std::string vtkDICOMCharacterSet::ToSafeUTF8	(	const char *	text,
		size_t	l
	)		const

Convert text to UTF-8 that is safe to print to the console.

All control characters or unconvertible characters will be replaced by four-byte octal codes, e.g. '\033'. Backslashes will be replaced by '\134' to avoid any potential ambiguity.

◆ ToUTF8()

std::string vtkDICOMCharacterSet::ToUTF8	(	const char *	text,
		size_t	l,
		size_t *	lp = `nullptr`
	)		const

Convert text from this encoding to UTF-8.

This will convert text to UTF-8, which is generally a lossless process for western languages but not for the CJK languages. Characters that cannot be mapped to unicode, or whose place in unicode is not known, will be printed as unicode U+FFFD which appears as a question mark in a diamond. If you pass a non-null value for the "lp" parameter, then "lp" will be set the position in the input string where the first conversion error occurred, and each unconverted byte will be output as <XX> (a hexadecimal code in angle brackets). If an error-free conversion was returned, then "lp" will be set to the length of the input string.

The documentation for this class was generated from the following file:

vtkDICOMCharacterSet.h

Public Types

Public Member Functions

Static Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ vtkDICOMCharacterSet() [1/2]

◆ vtkDICOMCharacterSet() [2/2]

Member Function Documentation

◆ CaseFoldedUTF8()

◆ CountBackslashes()

◆ FromUTF8()

◆ GetCharacterSetString()

◆ GetDefinedTerm()

◆ GetMIMEName()

◆ GetName()

◆ IsBiDirectional()

◆ IsISO2022()

◆ NextBackslash()

◆ SetGlobalDefault()

◆ SetGlobalOverride()

◆ ToSafeUTF8()

◆ ToUTF8()