vtk-dicom  0.8.14
Public Types | Public Member Functions | Static Public Member Functions | List of all members
vtkDICOMCharacterSet Class Reference

Character sets. More...

#include <vtkDICOMCharacterSet.h>

Public Types

enum  EnumType {
  ISO_IR_6 = 0 , ISO_IR_13 = 1 , ISO_IR_100 = 8 , ISO_IR_101 = 9 ,
  ISO_IR_109 = 10 , ISO_IR_110 = 11 , ISO_IR_144 = 12 , ISO_IR_127 = 13 ,
  ISO_IR_126 = 14 , ISO_IR_138 = 15 , ISO_IR_148 = 16 , X_LATIN6 = 17 ,
  ISO_IR_166 = 18 , X_LATIN7 = 19 , X_LATIN8 = 20 , X_LATIN9 = 21 ,
  X_LATIN10 = 22 , X_EUCKR = 24 , X_GB2312 = 25 , ISO_2022_IR_6 = 32 ,
  ISO_2022_IR_13 = 33 , ISO_2022_IR_87 = 34 , ISO_2022_IR_159 = 36 , ISO_2022_IR_100 = 40 ,
  ISO_2022_IR_101 = 41 , ISO_2022_IR_109 = 42 , ISO_2022_IR_110 = 43 , ISO_2022_IR_144 = 44 ,
  ISO_2022_IR_127 = 45 , ISO_2022_IR_126 = 46 , ISO_2022_IR_138 = 47 , ISO_2022_IR_148 = 48 ,
  ISO_2022_IR_166 = 50 , ISO_2022_IR_149 = 56 , ISO_2022_IR_58 = 57 , ISO_IR_192 = 64 ,
  GB18030 = 65 , GBK = 66 , X_BIG5 = 67 , X_EUCJP = 69 ,
  X_SJIS = 70 , X_CP874 = 76 , X_CP1250 = 80 , X_CP1251 = 81 ,
  X_CP1252 = 82 , X_CP1253 = 83 , X_CP1254 = 84 , X_CP1255 = 85 ,
  X_CP1256 = 86 , X_CP1257 = 87 , X_KOI8 = 90 , Unknown = 255
}
 

Public Member Functions

 vtkDICOMCharacterSet ()
 Construct an object that describes the default (ASCII) character set.
 
 vtkDICOMCharacterSet (int k)
 Construct a character set object from a given code. More...
 
 vtkDICOMCharacterSet (const std::string &name)
 Construct a character set object from a SpecificCharacterSet value. More...
 
 vtkDICOMCharacterSet (const char *name, size_t nl)
 
std::string GetCharacterSetString () const
 Generate SpecificCharacterSet code values (diagnostic only). More...
 
unsigned char GetKey () const
 Get the numerical code for this character set object.
 
std::string FromUTF8 (const char *text, size_t l, size_t *lp=0) const
 Convert text from UTF-8 to this encoding. More...
 
std::string FromUTF8 (const std::string &text) const
 
std::string ToUTF8 (const char *text, size_t l, size_t *lp=0) const
 Convert text from this encoding to UTF-8. More...
 
std::string ToUTF8 (const std::string &text) const
 
std::string ConvertToUTF8 (const char *text, size_t l) const
 Obsolete method for converting to UTF8.
 
std::string ToSafeUTF8 (const char *text, size_t l) const
 Convert text to UTF-8 that is safe to print to the console. More...
 
std::string ToSafeUTF8 (const std::string &text) const
 
std::string CaseFoldedUTF8 (const char *text, size_t l) const
 Convert text into a form suitable for case-insensitive matching. More...
 
std::string CaseFoldedUTF8 (const std::string &text) const
 
bool IsISO2022 () const
 Returns true if ISO 2022 escape codes are used. More...
 
bool IsISO8859 () const
 Returns true if this uses an ISO 8859 code page.
 
bool IsBiDirectional () const
 Check for bidirectional character sets. More...
 
unsigned int CountBackslashes (const char *text, size_t l) const
 Count the number of backslashes in an encoded string. More...
 
size_t NextBackslash (const char *text, const char *end) const
 Get the offset to the next backslash, or to the end of the string. More...
 
bool operator== (vtkDICOMCharacterSet b) const
 
bool operator!= (vtkDICOMCharacterSet b) const
 
bool operator<= (vtkDICOMCharacterSet a) const
 
bool operator>= (vtkDICOMCharacterSet a) const
 
bool operator< (vtkDICOMCharacterSet a) const
 
bool operator> (vtkDICOMCharacterSet a) const
 

Static Public Member Functions

static void SetGlobalDefault (vtkDICOMCharacterSet cs)
 Set the character set to use if SpecificCharacterSet is missing. More...
 
static vtkDICOMCharacterSet GetGlobalDefault ()
 
static void SetGlobalOverride (bool b)
 Override the value stored in SpecificCharacterSet with the default. More...
 
static void GlobalOverrideOn ()
 
static void GlobalOverrideOff ()
 
static bool GetGlobalOverride ()
 

Detailed Description

Character sets.

This class provides the means to convert the various international text encodings used by DICOM to UTF-8 and back again.

During conversion to UTF-8, any codes the original encoding that can't be converted are replaced by Unicode's "REPLACEMENT CHARACTER", which is a question mark in a black diamond. For instance, if the original encoding is ISO_IR_6 (ASCII), any octets outside of the valid ASCII range of 0 to 127 will become "REPLACEMENT CHARACTER".

DICOM supports a fairly small number of single-byte and multi-byte character sets. The only VRs that support these character sets are PN, LO, SH, ST, LT, and ST (all other text VRs must be ASCII). In addition to ASCII, there are eleven 8-bit single-byte encodings, three iso-2022 multi-byte encodings, and three variable-length encodings (UTF-8, GB18030, GBK).

In some DICOM data sets, especially old ones, the SpecificCharacterSet attribute will be missing and it might be necessary to manually specify a character set for the application to use. Use SetGlobalDefault() to do so. The vtkDICOMCharacterSet constructor can take the desired character encoding as a string, where the following encodings are allowed: 'ascii', 'latin1', 'latin2', 'latin3', 'latin4', 'latin5' 'latin7', 'latin9', 'cyrillic' (iso-8859-5), 'arabic' (iso-8859-6), 'greek' (iso-8859-7), 'hebrew' (iso-8859-8), 'tis-620', 'shift-jis', 'euc-jp', 'iso-2022-jp', 'korean' (euc-kr), 'chinese' (gbk), 'gb18030', 'big5', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 'cp1257', and 'utf-8'. Common aliases of these character sets can also be used.

Constructor & Destructor Documentation

◆ vtkDICOMCharacterSet() [1/2]

vtkDICOMCharacterSet::vtkDICOMCharacterSet ( int  k)
inline

Construct a character set object from a given code.

The code can be any of the enumerated code values. The ISO_2022 flag can be added to any of the ISO-8859 codes to indicate that the character set allows the use of escape codes. Also note that ISO_2022_IR_87 and ISO_2022_IR_159 are combining codes that can be added to each other and to ISO_IR_13. Specifying any other codes in combination can lead to undefined results, for example "ISO_2022_IR_100 | ISO_2022_IR_101" is not permitted and "ISO_2022_IR_100" must be used instead.

◆ vtkDICOMCharacterSet() [2/2]

vtkDICOMCharacterSet::vtkDICOMCharacterSet ( const std::string &  name)
inlineexplicit

Construct a character set object from a SpecificCharacterSet value.

This generates an 8-bit code that uniquely identifies a DICOM character set plus its code extensions.

Member Function Documentation

◆ CaseFoldedUTF8()

std::string vtkDICOMCharacterSet::CaseFoldedUTF8 ( const char *  text,
size_t  l 
) const

Convert text into a form suitable for case-insensitive matching.

This function will perform case normalization on a string by converting it to lowercase, and by normalizing the forms of lowercase characters that do not have an exact uppercase equivalent. In some cases, it might increase the length of the string. It covers modern European scripts (including Greek and Cyrillic) and latin characters used in East Asian languages.

◆ CountBackslashes()

unsigned int vtkDICOMCharacterSet::CountBackslashes ( const char *  text,
size_t  l 
) const

Count the number of backslashes in an encoded string.

The backslash byte is sometimes present as half of a multibyte character in the Japanese and Chinese encodings. This method skips these false backslashes and counts only real backslashes.

◆ FromUTF8()

std::string vtkDICOMCharacterSet::FromUTF8 ( const char *  text,
size_t  l,
size_t *  lp = 0 
) const

Convert text from UTF-8 to this encoding.

Attempt to convert from UTF-8 to this character set. Every non-convertible character will be replaced with '?'. If you pass a non-null value for the "lp" parameter, it will return the position in the input UTF-8 string where the first conversion error occurred. If a successful conversion was returned, then lp will be set to the length of the input string.

◆ GetCharacterSetString()

std::string vtkDICOMCharacterSet::GetCharacterSetString ( ) const

Generate SpecificCharacterSet code values (diagnostic only).

Attempt to generate SpecificCharacterSet code values. If ISO 2022 encoding is not used, then a single code value is returned. If ISO 2022 encoding is used with the single-byte character sets, then only the code value for first character set will be returned (due to limitations in the way this class stores the information). However, if ISO 2022 encoding is used with the multi-byte character sets, the result is a set of backslash-separated code values, where the first value will be empty if the initial coding is ASCII.

◆ IsBiDirectional()

bool vtkDICOMCharacterSet::IsBiDirectional ( ) const
inline

Check for bidirectional character sets.

This is used to check for character sets that are likely to contain characters that print right-to-left, specifically Hebrew and Arabic. Note that even though some parts of unicode fall into this category, this flag is off for unicode and GB18030/GBK.

◆ IsISO2022()

bool vtkDICOMCharacterSet::IsISO2022 ( ) const
inline

Returns true if ISO 2022 escape codes are used.

If this method returns true, then escape codes can be used to switch between character sets.

◆ NextBackslash()

size_t vtkDICOMCharacterSet::NextBackslash ( const char *  text,
const char *  end 
) const

Get the offset to the next backslash, or to the end of the string.

In order to work properly, this method requires that its input is either at the beginning of the string or just after a backslash.

◆ SetGlobalDefault()

static void vtkDICOMCharacterSet::SetGlobalDefault ( vtkDICOMCharacterSet  cs)
inlinestatic

Set the character set to use if SpecificCharacterSet is missing.

Some DICOM files do not list a SpecificCharacterSet attribute, but nevertheless use a non-ASCII character encoding. This method can be used to specify the character set in absence of SpecificCharacterSet. If SpecificCharacterSet is present, the default will not override it unless OverrideCharacterSet is true.

◆ SetGlobalOverride()

static void vtkDICOMCharacterSet::SetGlobalOverride ( bool  b)
inlinestatic

Override the value stored in SpecificCharacterSet with the default.

This method can be used if the SpecificCharacterSet attribute of a file is incorrect. It forces the use of the character set that was set with SetGlobalDefault.

◆ ToSafeUTF8()

std::string vtkDICOMCharacterSet::ToSafeUTF8 ( const char *  text,
size_t  l 
) const

Convert text to UTF-8 that is safe to print to the console.

All control characters or unconvertible characters will be replaced by four-byte octal codes, e.g. '\033'. Backslashes will be replaced by '\134' to avoid any potential ambiguity.

◆ ToUTF8()

std::string vtkDICOMCharacterSet::ToUTF8 ( const char *  text,
size_t  l,
size_t *  lp = 0 
) const

Convert text from this encoding to UTF-8.

This will convert text to UTF-8, which is generally a lossless process for western languages but not for the CJK languages. Characters that cannot be mapped to unicode, or whose place in unicode is not known, will be printed as unicode U+FFFD which appears as a question mark in a diamond. If you pass a non-null value for the "lp" parameter, it will return the position in the input string where the first conversion error occurred. If a successful conversion was returned, then lp will be set to the length of the input string.


The documentation for this class was generated from the following file: