How to gunzip on the fly with Python

For a long time I wondered how R was able to recognize gzipped files and decompress them on the fly. This is neat because the large data files that we manipulate in bio-informatics are better kept compressed on the disk and decompressed upon loading them in memory.

Most binary file formats start with a magic number, indicating which file type it is. A properly gzipped file starts with 1F8B. You need to read the first two bytes, and once you figure out whether the file is compressed, you either read the file as usual, or read it with the functions of the gzip package.

Here I wrote a small module called gzopen.py. After importing the class gzopen, you can use it to seamlessly open gzipped files.

# -*- coding:utf-8 -*-

import gzip

class gzopen(object):
"""Generic opener that decompresses gzipped files
if needed. Encapsulates an open file or a GzipFile.
Use the same way you would use 'open()'.
"""
def __init__(self, fname):
f = open(fname)
# Read magic number (the first 2 bytes) and rewind.
magic_number = f.read(2)
f.seek(0)
# Encapsulated 'self.f' is a file or a GzipFile.
if magic_number == '\x1f\x8b':
self.f = gzip.GzipFile(fileobj=f)
else:
self.f = f

# Define '__enter__' and '__exit__' to use in
# 'with' blocks. Always close the file and the
# GzipFile if applicable.
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
try:
self.f.fileobj.close()
except AttributeError:
pass
finally:
self.f.close()

# Reproduce the interface of an open file
# by encapsulation.
def __getattr__(self, name):
return getattr(self.f, name)
def __iter__(self):
return iter(self.f)
def next(self):
return next(self.f)

If there is a gzipped file called tmp.txt.gz in the same directory as gzopen.py, you can read it the same way you would with open. Below is an example.

from gzopen import gzopen
with gzopen('tmp.txt.gz') as f:
f.read()

« | »





comments powered by Disqus