Friday, April 8, 2011

How to best write out a std::vector < std::string > container to a HDF5 dataset?

Given a vector of strings, what is the best way to write them out to a HDF5 dataset? At the moment I'm doing something like the following:

  const unsigned int MaxStrLength = 512;

  struct TempContainer {
    char string[MaxStrLength];
  };

  void writeVector (hid_t group, std::vector<std::string> const & v)
  {
    //
    // Firstly copy the contents of the vector into a temporary container
    std::vector<TempContainer> tc;
    for (std::vector<std::string>::const_iterator i = v.begin ()
                                              , end = v.end ()
      ; i != end
      ; ++i)
    {
      TempContainer t;
      strncpy (t.string, i->c_str (), MaxStrLength);
      tc.push_back (t);
    }


    //
    // Write the temporary container to a dataset
    hsize_t     dims[] = { tc.size () } ;
    hid_t dataspace = H5Screate_simple(sizeof(dims)/sizeof(*dims)
                               , dims
                               , NULL);

    hid_t strtype = H5Tcopy (H5T_C_S1);
    H5Tset_size (strtype, MaxStrLength);

    hid_t datatype = H5Tcreate (H5T_COMPOUND, sizeof (TempConainer));
    H5Tinsert (datatype
      , "string"
      , HOFFSET(TempContainer, string)
      , strtype);

    hid_t dataset = H5Dcreate1 (group
                          , "files"
                          , datatype
                          , dataspace
                          , H5P_DEFAULT);

    H5Dwrite (dataset, datatype, H5S_ALL, H5S_ALL, H5P_DEFAULT, &tc[0] );

    H5Dclose (dataset);
    H5Sclose (dataspace);
    H5Tclose (strtype);
    H5Tclose (datatype);
}

At a minimum, I would really like to change the above so that:

  1. It uses variable length strings
  2. I don't need to have a temporary container

I have no restrictions over how I store the data so for example, it doesn't have to be a COMPOUND datatype if there is a better way to do this.

EDIT: Just to narrow the problem down, I'm relatively familiar with playing with the data on the C++ side, it's the HDF5 side where I need most of the help.

Thanks for your help.

From stackoverflow
  • If you are looking at cleaner code: I suggest you create a functor that'll take a string and save it to the HDF5 Container (in a desired mode). Richard, I used the wrong algorithm, please re-check!

    std::for_each(v.begin(), v.end(), write_hdf5);
    
    struct hdf5 : public std::unary_function<std::string, void> {
        hdf5() : _dataset(...) {} // initialize the HDF5 db
        ~hdf5() : _dataset(...) {} // close the the HDF5 db
        void operator(std::string& s) {
                // append 
                // use s.c_str() ?
        }
    };
    

    Does that help get started?

    Richard Corden : Well - yes I'm hoping to be able reach this kind of style - however, I wasn't sure if it was (a) possible and (b) efficient. Thanks for the answer.
    Richard Corden : I'm really very new to HDF5, so I have no idea what needs to be written where you have "// append".
    dirkgently : I've only so much heard of HDF5. I meant by append whatever you are doing under the comment // Write the temporary container to a dataset.
    Richard Corden : And this the crux of the problem. The "H5Dwrite" method takes a 'void*' argument and writes that, it's a bit like "memcpy" or "memmove" where you give it a size and a block of data. At least that's what I currently think! :)
    dirkgently : So use your_data_string.c_str() and your_data_string.size(). void* is really a way of letting any kind of data pass-through. I wonder why you need the struct TempContainer though.
    Richard Corden : Ok. The key question is: What are the parameters to H5Dwrite so that it will only append a single string into the next slot of the dataset? Then, as you say I can use a .c_str method. But currently I am having to first create a C style container with my strings which is dumped in one go.
    dirkgently : What I am suggesting is, instead of creating a single huge string and a single HDFwrite() call, can't you do the opposite i.e. write many strings using multiple HDFwrite() calls?
    Richard Corden : So we're asking the same question now! ;) All of the examples for H5Dwrite that I've found appear to write the entire "dataset" in one go, not entry by entry.
    dirkgently : From what I can see here -- http://www.hdfgroup.org/HDF5/doc/RM/RM_H5D.html#Dataset-Write -- this approach should work.
    dirkgently : 'HDFwrite() writes a partial dataset at a time.'
    Richard Corden : Thanks for helping with this. One of the issues is that I do not know how to write a partial dataset yet, nor have I been able to find examples that use anything other than H5S_ALL for the dataspce. Re: HDFwrite, it appears to be matlab specific? At least it is not in the HDF5 library I have.
    dirkgently : No its there in the C library as well. Have you seen this already: http://www.hdfgroup.org/HDF5/doc/Intro/IntroExamples.html#CheckAndReadExample ?
    Richard Corden : I hadn't seen this - I'll try it out. Thanks for this.
    dirkgently : Also this: Reading and Writing a portion of a dataset somewhere nearly halfway through this page http://www.hdfgroup.org/HDF5/doc/H5.intro.html#Intro-WhatIs.
  • I don't know about HDF5, but you can use

    struct TempContainer {
        char* string;
    };
    

    and then copy the strings this way:

    TempContainer t;
    t.string = strdup(i->c_str());
    tc.push_back (t);
    

    This will allocate a string with the exact size, and also improves a lot when inserting or reading from the container (in your example there's an array copied, in this case only a pointer). You can also use std::vector:

    std::vector<char *> tc;
    ...
    tc.push_back(strdup(i->c_str());
    
    Richard Corden : Sure. Ideally I wouldn't need the temporary container at all. This code adds the slight disadvantage that the memory needs to be freed explicitly.
  • Instead of a TempContainer, you can use a simple std::vector (you could also templatized it to match T -> basic_string . Something like this:

    #include <algorithm>
    #include <vector>
    #include <string>
    #include <functional>
    
    class StringToVector
      : std::unary_function<std::vector<char>, std::string> {
    public:
      std::vector<char> operator()(const std::string &s) const {
        // assumes you want a NUL-terminated string
        const char* str = s.c_str();
        std::size_t size = 1 + std::strlen(str);
        // s.size() != strlen(s.c_str())
        std::vector<char> buf(&str[0], &str[size]);
        return buf;
      }
    };
    
    void conv(const std::vector<std::string> &vi,
              std::vector<std::vector<char> > &vo)
    {
      // assert vo.size() == vi.size()
      std::transform(vi.begin(), vi.end(),
                     vo.begin(),
                     StringToVector());
    }
    
  • [Many thanks to dirkgently for his help in answering this.]

    To write a variable length string in HDF5 use the following:

    // Create the datatype as follows
    hid_t datatype = H5Tcopy (H5T_C_S1);
    H5Tset_size (datatype, H5T_VARIABLE);
    
    // 
    // Pass the string to be written to H5Dwrite
    // using the address of the pointer!
    const char * s = v.c_str ();
    H5Dwrite (dataset
      , datatype
      , H5S_ALL
      , H5S_ALL
      , H5P_DEFAULT
      , &s );
    

    One solution for writing a container is to write each element individually. This can be achieved using hyperslabs.

    For example:

    class WriteString
    {
    public:
      WriteString (hid_t dataset, hid_t datatype
          , hid_t dataspace, hid_t memspace)
        : m_dataset (dataset), m_datatype (datatype)
        , m_dataspace (dataspace), m_memspace (memspace)
        , m_pos () {}
    
    private:
      hid_t m_dataset;
      hid_t m_datatype;
      hid_t m_dataspace;
      hid_t m_memspace;
      int m_pos;
    

    //...

    public:
      void operator ()(std::vector<std::string>::value_type const & v)
      {
        // Select the file position, 1 record at position 'pos'
        hsize_t count[] = { 1 } ;
        hsize_t offset[] = { m_pos++ } ;
        H5Sselect_hyperslab( m_dataspace
          , H5S_SELECT_SET
          , offset
          , NULL
          , count
          , NULL );
    
        const char * s = v.c_str ();
        H5Dwrite (m_dataset
          , m_datatype
          , m_memspace
          , m_dataspace
          , H5P_DEFAULT
          , &s );
        }    
    };
    

    // ...

    void writeVector (hid_t group, std::vector<std::string> const & v)
    {
      hsize_t     dims[] = { m_files.size ()  } ;
      hid_t dataspace = H5Screate_simple(sizeof(dims)/sizeof(*dims)
                                        , dims, NULL);
    
      dims[0] = 1;
      hid_t memspace = H5Screate_simple(sizeof(dims)/sizeof(*dims)
                                        , dims, NULL);
    
      hid_t datatype = H5Tcopy (H5T_C_S1);
      H5Tset_size (datatype, H5T_VARIABLE);
    
      hid_t dataset = H5Dcreate1 (group, "files", datatype
                                 , dataspace, H5P_DEFAULT);
    
      // 
      // Select the "memory" to be written out - just 1 record.
      hsize_t offset[] = { 0 } ;
      hsize_t count[] = { 1 } ;
      H5Sselect_hyperslab( memspace, H5S_SELECT_SET, offset
                         , NULL, count, NULL );
    
      std::for_each (v.begin ()
          , v.end ()
          , WriteStrings (dataset, datatype, dataspace, memspace));
    
      H5Dclose (dataset);
      H5Sclose (dataspace);
      H5Sclose (memspace);
      H5Tclose (datatype);
    }
    
    dirkgently : You know what? HDF5 is one of those things I've always wanted to read and write. But procrastination being my middle name that hasn't materialized. Thanks to you I've decided to give it a more dedicated shot this time. I'd be very, very interested to know, where you are using this, if possible.
    Richard Corden : We are looking to change how our static analysis tool stores the data that it gathers from its analysis. The data will contain tree like structures (scopes, types etc) and lists of diagnostics. At this stage I'm just evaluating how well HDF5 handles the different types of data.
    Richard Corden : This question (that I asked) outlines the kind of features that we are evaluating for: http://stackoverflow.com/questions/547195/evaluating-hdf5-what-limitations-features-does-hdf5-provide-for-modelling-data

0 comments:

Post a Comment