Wednesday, March 30, 2011

Setting data type when reading XML data in SAS

I need to control the data type when reading XML data in SAS. The XML data are written and accessed using the XML libname engine in SAS.

SAS seems to guess the data type based on the contents of a column: If I write "20081002" to my XML data in a character column, it will be read back in as a numerical variable.

An example:

filename my_xml '/tmp/my.xml'; * Yes, I use SAS on Unix *;
libname my_xml XML;

data my_xml.data_type_test;
  text_char="This is obviously text";
  date_char="20081002";
  num_char="42";
  genuine_num=42;
run;

proc copy inlib=my_xml outlib=WORK;
run;

libname my_xml;
filename my_xml CLEAR;

Only the last column is defined as numerical data type in the XML data, but when I copy it into my WORK library, only the column *text_char* is character. The other 3 are now numeric.

How can I control the data type when reading XML data in SAS?

From stackoverflow
  • I think you need to define some xml specific options whith your libname XML statement for export go:

    libname my_xml_out XML XMLMETA=SCHEMADATA;
    

    To include the data schema. Also, you might want to save the XML schema to a separate file for later import:

    libname my_xml_in XML XMLSCHEMA='external-file'
    

    after you exported the schema using XMLMETA=SCHEMA of course. I think this is the documentation you need.

    Apart from that liberal use of format statements on original dataset creation is recommended.

    Martin Bøgelund : Thanks for the input. Unfortunately, I can't get it to work - SAS still claims that my character data that look like numbers, are of the NUM data type, when read from XML using XMLSCHEMA.
  • Take a look at the SAS XML Mapper. It allows you to create a map to read (and wrte in 9.2) XML files and specifying column attributes.

    If this is your XML file:

    This is obviously text 20081002 42 42

    You could create a MAP like this:

    <!-- ############################################################ -->
    <TABLE name="DATA_TYPE_TEST">
        <TABLE-PATH syntax="XPath">/TABLE/DATA_TYPE_TEST</TABLE-PATH>
    
        <COLUMN name="text_char">
            <PATH syntax="XPath">/TABLE/DATA_TYPE_TEST/text_char</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>22</LENGTH>
        </COLUMN>
    
        <COLUMN name="date_char">
            <PATH syntax="XPath">/TABLE/DATA_TYPE_TEST/date_char</PATH>
            <TYPE>numeric</TYPE>
            <DATATYPE>integer</DATATYPE>
            <FORMAT width="9">DATE</FORMAT>
            <INFORMAT width="8">ND8601DA</INFORMAT>
        </COLUMN>
    
        <COLUMN name="num_char">
            <PATH syntax="XPath">/TABLE/DATA_TYPE_TEST/num_char</PATH>
            <TYPE>character</TYPE>
            <DATATYPE>string</DATATYPE>
            <LENGTH>2</LENGTH>
        </COLUMN>
    
        <COLUMN name="genuine_num">
            <PATH syntax="XPath">/TABLE/DATA_TYPE_TEST/genuine_num</PATH>
            <TYPE>numeric</TYPE>
            <DATATYPE>integer</DATATYPE>
        </COLUMN>
    
    </TABLE>
    

    And then read the XML file:

    filename  my 'C:\temp\my.xml';
    filename  SXLEMAP 'C:\temp\MyMap.map';
    libname   my xml xmlmap=SXLEMAP access=READONLY;
    
    title 'Table DATA_TYPE_TEST';
    proc contents data=my.DATA_TYPE_TEST varnum; 
    run;
    proc print data=my.DATA_TYPE_TEST(obs=10); 
    run;
    

    Result:

    Table DATA_TYPE_TEST
    
    The CONTENTS Procedure
    
    Data Set Name        MY.DATA_TYPE_TEST    Observations            
    Member Type          DATA                 Variables             4 
    Engine               XML                  Indexes               0 
    Created              .                    Observation Length    0 
    Last Modified        .                    Deleted Observations  0 
    Protection                                Compressed            NO
    Data Set Type                             Sorted                NO
    Label                                                             
    Data Representation  Default                                      
    Encoding             Default                                      
    
    
    Variables in Creation Order
    
    #    Variable       Type    Len    Format    Informat      Label
    
    1    text_char      Char     22    $22.      $22.          text_char  
    2    date_char      Num       8    DATE9.    ND8601DA8.    date_char  
    3    num_char       Char      2    $2.       $2.           num_char   
    4    genuine_num    Num       8    F8.       F8.           genuine_num
    
    Table DATA_TYPE_TEST
    
                                                                   genuine_
         Obs    text_char                 date_char    num_char      num
    
           1    This is obviously text    02OCT2008       42             42
    

0 comments:

Post a Comment