Friday, January 6, 2017

Python - Pandas - How to read compressed SAS data


Pandas provides the function read_sas to read the sas data. It supports two format: (1) ‘xport’ and (2) ‘sas7bdat’. Sometimes the data is really large and is provided in a compressed file. Unfortunately, it seems that pandas does not support reading from the compressed sas data directly. 


In this post, we will introduce a not-quite-save way to read from the zipped sas data. In order to do so, we need three packages:

  1. gzip
  2. io
  3. pandas
If we try the following code, it is not going to work. The reason is the in the compressed gz file, there exists meta information about the original file and the system in the header before the actual data. So when we pass the file object to the pd.read_sas, the function complains that the file is not a SAS file.

f = gzip.GzipFile('your_input_file.sas7bdat.tar.gz', 'rb')
df = pd.read_sas(f, format='sas7bdat')


So how to address this problem? There is a very straightforward way to handle this, but it is not very safe (this point will be explained later). The simple idea is try to skip the meta information header in the gz file.

Here is the code. In part 1, we overwrite the seek function of the gzip.GzipFile. The function takes into account the fact that we skip the header and when the seek function is called in the pd.read_sas function, an additional offset (of the header) is added to the original offset. In part 2, we just guess where the meta header information ends.

As mentioned previously, this method is not safe because when we overwrite the seek function, we do not take into account the whence parameter. Fortunately, this code can work (at least for some small files).


# part 1: change the "seek" behavior of the file object.
class FileObjGZ(gzip.GzipFile):
    def set_gz_offset(self, val):
        self._gz_offset = val

    def seek(self, offset, whence=io.SEEK_SET):
        new_offset = offset + self._gz_offset
        super(FileObjGZ, new_offset).seek(new_offset)


input_file = 'your_input_file.sas7bdat.tar.gz'

# part 2: try to skip the meta header
with FileObjGZ(input_file, 'rb') as f:
    guess_gz_offset = 0
    while True:
        try:
            f.set_gz_offset(guess_gz_offset)
            f.seek(0)
            df = pd.read_sas(f, format='sas7bdat')
            break
        except ValueError as e:
            print(e)
            guess_gz_offset += 1
    print("loading data is completed.")



---END--

1 comment: