Google Sitemap Generator MemoryError Workaround Fix
February 6th, 2008 by admin
If you have ever used the Google Sitemap Generator python script on a busy Linux/Unix server where your access logs grow huge, chances are you have run into this error:
MemoryError
Basically, the script is calling the entire log file into memory before parsing it, and depending on your log file size and amount of memory on your machine, this could just be too much for it. So what do you do now?
Myself, I did what I assume you are doing right now and began Googling for answers, and other than an old fix for an older version (which didn’t work for the latest), nothing fixed the issue.
I just sat there wondering how I was going to shrink this log file down so the script could handle this, especially when I wasn’t there to monitor it (i.e. a cron job). I asked myself, “I wish I could just split it into smaller files”… Finally, the light came on: Duh, isn’t this one of the reasons the split command was created?
So I wrote up this little shell script that I could call via cron and placed it in a non web-accessible directory named sitemap (download here):
cd /home/username/sitemap rm -f log-* rm -f access_log cp /var/log/access_log ./ split -l 50000 access_log log- /usr/local/bin/python2.4 /home/username/sitemap/sitemap_gen.py --config=/home/username/sitemap/config.xml
This worked like a charm. So what is happening here?
When the script is executed, it deletes all files in the sitemap directory that start with log- and the file access_log, copies the access_log file from the /var/log directory, splits the file into smaller files of 50,000 lines each, and then executes the sitemap generator.
The sitemap generator allows you to use a wild card for access logs, so I just told it my log starts with log- in my config.xml file:
<accesslog path="/home/username/sitemap/log-*" />
That was it! It can now handle any size log file. Now if you still get the MemoryError error, try reducing the size of the split files to less than 50000.
Happy hacking!
This entry was posted on Wednesday, February 6th, 2008 at 3:27 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
April 9th, 2008 at 5:32 am
It worked !
Thanks. A little Googling and I found your answer to this annoyance. They should really build this as a feature into their next version I would think.