First Python Project – Extract .zip Files

Projects & Blog
Auxiliary GIS



First Python Project – Extract .zip Files

Digging through some old projects I found my first Python project. This is a fairly large leap in time compared to my previous post, but I wanted to get some awesome scripting in here. Queue my first script, automated .zip extraction.

My actual first scripts of course were basic “Hello World” and move a file here to there, but this was the first useful project. The situation was that my shop at the time had a large amount of files to pull from archives on a regular basis for analysis. All of these files were in compressed .zip formats and I needed to extract all of these, find specific files and move them to a desired work space. The reason this wasn’t a simple process of select all .zips and just “Extract here…” was that the folder structure within the .zip files would have several small trees where manual searching was just simply too time consuming for what could be done in seconds. The other issue was finding only specific file types and moving them into a singular work space to run analysis on later.

After a quick search it looked like most of this magic could be done with simple python by importing modules os, fnmatch, and zipfile. These modules would easily allow us to work with basic operating system (os) functions, utilize filenames (fnmatch) and other variables associated with the file, and compression (zipfile).
I personally use as many variables and append them to the top of my script when possible. This gives me an easy way to find and update the script for repeated use. As far as I know most IDEs do not allow basic code to be paused waiting for user input, so I’ll have to edit code and run it. As such I start out my project with:

I imported all of my necessary modules and added the variables I needed. We can see that I have an input to easily change since I can’t assume any other analyst will always have the same folder, the file type we are specifically looking for, and the output since I need any particular analyst to choose their desired output location for their own work. Of interest here is the use of r before the SourceDirectory and DestinationDirectory variable strings. This causes the provided strings to be “raw string literals” where backslashes are treated as such. Without these particular folder names would be altered based on escape codes. As an example \n means a new line. So any folder name with “\n” anywhere in its path will potentially be cut off there. A work around would be to use a backslash to escape every backslash. So C:\new\folder would be read valid if typed as C:\\new\\folder. This isn’t anything too crazy, but I would ideally expect my analysts to simply copy and paste to the variables without any changes. It would also be possible to use string.replace(), but even this gets tricky since I need to worry about escaping my backslashes. Something like SourceDirectory.replace('/','//') will fail because the backslashes need to be escaped. SourceDirectory.replace('//','////') would suffice but this is clearly becoming more difficult than using “new string literals.”

Now for some magic. Using some for  loops we can crawl an entire workspace, take the list of everything found. Use another loop to extract everything it finds based on an extension, namely .zip files.

The next piece looked like this:

Here we use os.walk  to crawl a workspace and list out information related to directory and files. os.walk is recursive, so it will search entire directory trees, I don’t need to worry as much about any analyst’s personal download structure. Next for each item it finds it then uses fnmatch to filter out all zip files. This is done by using the asterisks (*) as a wildcard to signify any file with that extension. In the first line, I have the variable root to list out the file path a file is in, I then also have the filename variable in the second line for each .zip file found. I can use the combination of os.path.join to combine these two variables to have a complete path to each .zip file. The next line uses it simply to indicate with a Python message what it is currently working on. I’m a huge fan of messages indicating progress. It is more useful in the following line with .zipfile. Here I’m using the same combination in order to run .extractall. This will extract all contents into the current directory. Next is to choose the name of the folder to extract into. We ideally do not want to simply extract with the default folder structure as the similarly named folders may allow files to be overwritten. We need unique folder names, and ideally using the original names to keep them in order and make sense in case they are needed. In this instance the filename variable won’t work as it will include the .zip extension. However, another useful tool is os.path.splitext. This will break down an ideal filename into a root (the actual file name) and an ext (extension). In this case, I use [0] to indicate I only need the first output.. the root or file name. If I used a [1] it would grab the second output.. the extension. Since the extension is now removed in the extract variables Python will now recursively find all .zip files and extract them to an identically named folder in the same directory. Excellent start!

Before I move onto the next piece I wanted to add a message to confirm the current progress. I create a simple one-time print message with a small delay before proceeding. In order to add the additional functionality of a delay, I need to add time to the import modules. So now my first line is: import zipfile,fnmatch,os,time

Next we need to do some consolidation. We technically have everything out in the open, so we need to go on another search, but this time for our desired file type instead of .zip files. Then instead of extracting we simply need to copy/cut and paste to the specified directory. Due to these similarities this portion looks fairly close to the second piece.

the for loops are essentially the same. The only difference here is that instead of the hard-coded *.zip, I’ve thrown in the FileExtension variable. Note that I had to put it outside the quotes, so Python would not actually look for a file such as data.FileExtension. The plus sign appends the variable as supplied. Another similar print message for progress and then the final step, moving every file found the the output directory. In this case it appeared best to use another module, in this case shutil. My first line is now: import zipfile,fnmatch,os,time,shutil. The same os.path.join is used, but this time instead of using root to find the original I just replace it with the destination variable in order to keep the filename the same, and simply move folders. Very quick, very easy to read (I think) and it seems like a relatively small amount of code.

There is a large number of things that could be added to this to clean it up. Perhaps in a later post I will be taking this and adding a bit of arc.py code and a few checks/options in order to turn it into a tool. However for now, the code in its entirety is below.

Also, you can download the script here: https://s3.amazonaws.com/alanlclack.com/blog/downloads/Extract.py


Alan Clack

Leave a Reply

Your email address will not be published. Required fields are marked *

Alan Clack © 2016