vtk-dicom
0.8.17
|
Character sets. More...
#include <vtkDICOMCharacterSet.h>
Public Member Functions | |
vtkDICOMCharacterSet () | |
Construct an object that describes the default (ASCII) character set. | |
vtkDICOMCharacterSet (int k) | |
Construct a character set object from a given code. More... | |
vtkDICOMCharacterSet (const std::string &name) | |
Construct a character set object from a SpecificCharacterSet value. More... | |
vtkDICOMCharacterSet (const char *name, size_t nl) | |
std::string | GetCharacterSetString () const |
Generate SpecificCharacterSet code values (diagnostic only). More... | |
const char * | GetDefinedTerm () const |
Get the defined term (possible multi-valued) for this character set. More... | |
const char * | GetMIMEName () const |
Get the internet MIME name for this character set. More... | |
const char * | GetName () const |
Get a name that identifies this character set. More... | |
unsigned char | GetKey () const |
Get the numerical code for this character set object. | |
std::string | FromUTF8 (const char *text, size_t l, size_t *lp=nullptr) const |
Convert text from UTF-8 to this encoding. More... | |
std::string | FromUTF8 (const std::string &text) const |
std::string | ToUTF8 (const char *text, size_t l, size_t *lp=nullptr) const |
Convert text from this encoding to UTF-8. More... | |
std::string | ToUTF8 (const std::string &text) const |
std::string | ConvertToUTF8 (const char *text, size_t l) const |
Obsolete method for converting to UTF8. | |
std::string | ToSafeUTF8 (const char *text, size_t l) const |
Convert text to UTF-8 that is safe to print to the console. More... | |
std::string | ToSafeUTF8 (const std::string &text) const |
std::string | CaseFoldedUTF8 (const char *text, size_t l) const |
Convert text into a form suitable for case-insensitive matching. More... | |
std::string | CaseFoldedUTF8 (const std::string &text) const |
bool | IsISO2022 () const |
Returns true if ISO 2022 escape codes are used. More... | |
bool | IsISO8859 () const |
Returns true if this uses an ISO 8859 code page. | |
bool | IsBiDirectional () const |
Check for bidirectional character sets. More... | |
unsigned int | CountBackslashes (const char *text, size_t l) const |
Count the number of backslashes in an encoded string. More... | |
size_t | NextBackslash (const char *text, const char *end) const |
Get the offset to the next backslash, or to the end of the string. More... | |
bool | operator== (vtkDICOMCharacterSet b) const |
bool | operator!= (vtkDICOMCharacterSet b) const |
bool | operator<= (vtkDICOMCharacterSet a) const |
bool | operator>= (vtkDICOMCharacterSet a) const |
bool | operator< (vtkDICOMCharacterSet a) const |
bool | operator> (vtkDICOMCharacterSet a) const |
Static Public Member Functions | |
static void | SetGlobalDefault (vtkDICOMCharacterSet cs) |
Set the character set to use if SpecificCharacterSet is missing. More... | |
static vtkDICOMCharacterSet | GetGlobalDefault () |
static void | SetGlobalOverride (bool b) |
Override the value stored in SpecificCharacterSet with the default. More... | |
static void | GlobalOverrideOn () |
static void | GlobalOverrideOff () |
static bool | GetGlobalOverride () |
Character sets.
This class provides the means to convert the various international text encodings used by DICOM to UTF-8 and back again.
During conversion to UTF-8, any codes from the original encoding that can't be converted are replaced by Unicode's "REPLACEMENT CHARACTER", which is a question mark in a black diamond. For instance, if the original encoding is ISO_IR_6 (ASCII), any octets outside of the valid ASCII range of 0 to 127 will become "REPLACEMENT CHARACTER".
DICOM supports a fairly small number of single-byte and multi-byte character sets. The only VRs that support these character sets are PN, LO, SH, ST, LT, and ST (all other text VRs must be ASCII). In addition to ASCII, there are twelve 8-bit single-byte encodings, three iso-2022 multi-byte encodings, and three variable-length encodings (UTF-8, GB18030, GBK).
In some DICOM data sets, especially old ones, the SpecificCharacterSet attribute will be missing and it might be necessary to manually specify a character set for the application to use. Use SetGlobalDefault() to do so. The vtkDICOMCharacterSet constructor can take the desired character encoding as a string, where the following encodings are allowed: 'ascii', 'latin1', 'latin2', 'latin3', 'latin4', 'latin5' 'latin7', 'latin9', 'cyrillic' (iso-8859-5), 'arabic' (iso-8859-6), 'greek' (iso-8859-7), 'hebrew' (iso-8859-8), 'tis-620', 'shift-jis', 'euc-jp', 'iso-2022-jp', 'korean' (euc-kr), 'chinese' (gb2312), 'gbk', 'gb18030', 'big5', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', 'cp1258', and 'utf-8'. Common aliases of these character sets can also be used.
|
inline |
Construct a character set object from a given code.
The code can be any of the enumerated code values. The ISO_2022 flag can be added to any of the ISO-8859 codes to indicate that the character set allows the use of escape codes. Also note that ISO_2022_IR_87 and ISO_2022_IR_159 are combining codes that can be added to each other and to ISO_IR_13. Specifying any other codes in combination can lead to undefined results, for example "ISO_2022_IR_100 | ISO_2022_IR_101" is not permitted and "ISO_2022_IR_100" must be used instead.
|
inlineexplicit |
Construct a character set object from a SpecificCharacterSet value.
This generates an 8-bit code that uniquely identifies a DICOM character set plus its code extensions.
std::string vtkDICOMCharacterSet::CaseFoldedUTF8 | ( | const char * | text, |
size_t | l | ||
) | const |
Convert text into a form suitable for case-insensitive matching.
This function will perform case normalization on a string by converting it to lowercase, and by normalizing the forms of lowercase characters that do not have an exact uppercase equivalent. In some cases, it might increase the length of the string. It covers modern European scripts (including Greek and Cyrillic) and latin characters used in East Asian languages.
unsigned int vtkDICOMCharacterSet::CountBackslashes | ( | const char * | text, |
size_t | l | ||
) | const |
Count the number of backslashes in an encoded string.
The backslash byte is sometimes present as half of a multibyte character in the Japanese and Chinese encodings. This method skips these false backslashes and counts only real backslashes.
std::string vtkDICOMCharacterSet::FromUTF8 | ( | const char * | text, |
size_t | l, | ||
size_t * | lp = nullptr |
||
) | const |
Convert text from UTF-8 to this encoding.
Attempt to convert from UTF-8 to this character set. Every non-convertible character will be replaced with '?'. If you pass a non-null value for the "lp" parameter, then "lp" will be set to the position in the input UTF-8 string where the first conversion error occurred, and the unconverted character will be output as <U+XXXX> instead of '?'. If the conversion was error-free, then "lp" will be set to the length of the input string.
std::string vtkDICOMCharacterSet::GetCharacterSetString | ( | ) | const |
Generate SpecificCharacterSet code values (diagnostic only).
This will return the same value as GetDefinedTerm() if a defined term exists, otherwise it return the same value as GetName() if the character set has a name, with a final fallback to the number returned by GetKey() converted to a string.
const char* vtkDICOMCharacterSet::GetDefinedTerm | ( | ) | const |
Get the defined term (possible multi-valued) for this character set.
If the character set permitted by the DICOM standard, this will return the defined term, otherwise the returned value will be NULL. An empty string is returned for the default character set (ISO_IR 6). Multiple values will be separated by backslashes, e.g. "\\ISO 2022 IR 58" or "ISO 2022 IR 13\\ISO 2022 IR 87".
const char* vtkDICOMCharacterSet::GetMIMEName | ( | ) | const |
Get the internet MIME name for this character set.
The return value will be NULL if there isn't a good match between this character set and one of the MIME character sets in common use on the internet. So conversion may be necessary, either to UTF-8 or to a different encoding with a similar character repertoire. For example, "ISO 2022 IR 149" can be converted to "EUC-KR", "ISO 2022 IR 58" can can be converted to "GBK", and "ISO 2022 IR 13\\ISO 2022 IR 87" can be converted to "Shift_JIS". Note that "ISO-2022-JP" is not equivalent to DICOM's Japanese encodings since it does not allow half-width katakana or the "ISO 2022 IR 159" characters.
const char* vtkDICOMCharacterSet::GetName | ( | ) | const |
Get a name that identifies this character set.
For DICOM character sets, the name is based on the defined term, and for other character sets, the common name is used. If no name exists, then "Unknown" will be returned.
|
inline |
Check for bidirectional character sets.
This is used to check for character sets that are likely to contain characters that print right-to-left, specifically Hebrew and Arabic. Note that even though some parts of unicode fall into this category, this flag is off for unicode and GB18030/GBK.
|
inline |
Returns true if ISO 2022 escape codes are used.
If this method returns true, then escape codes can be used to switch between character sets.
size_t vtkDICOMCharacterSet::NextBackslash | ( | const char * | text, |
const char * | end | ||
) | const |
Get the offset to the next backslash, or to the end of the string.
In order to work properly, this method requires that its input is either at the beginning of the string or just after a backslash.
|
inlinestatic |
Set the character set to use if SpecificCharacterSet is missing.
Some DICOM files do not list a SpecificCharacterSet attribute, but nevertheless use a non-ASCII character encoding. This method can be used to specify the character set in absence of SpecificCharacterSet. If SpecificCharacterSet is present, the default will not override it unless OverrideCharacterSet is true.
|
inlinestatic |
Override the value stored in SpecificCharacterSet with the default.
This method can be used if the SpecificCharacterSet attribute of a file is incorrect. It forces the use of the character set that was set with SetGlobalDefault.
std::string vtkDICOMCharacterSet::ToSafeUTF8 | ( | const char * | text, |
size_t | l | ||
) | const |
Convert text to UTF-8 that is safe to print to the console.
All control characters or unconvertible characters will be replaced by four-byte octal codes, e.g. '\033'. Backslashes will be replaced by '\134' to avoid any potential ambiguity.
std::string vtkDICOMCharacterSet::ToUTF8 | ( | const char * | text, |
size_t | l, | ||
size_t * | lp = nullptr |
||
) | const |
Convert text from this encoding to UTF-8.
This will convert text to UTF-8, which is generally a lossless process for western languages but not for the CJK languages. Characters that cannot be mapped to unicode, or whose place in unicode is not known, will be printed as unicode U+FFFD which appears as a question mark in a diamond. If you pass a non-null value for the "lp" parameter, then "lp" will be set the position in the input string where the first conversion error occurred, and each unconverted byte will be output as <XX> (a hexadecimal code in angle brackets). If an error-free conversion was returned, then "lp" will be set to the length of the input string.