Unfortunately, changes to Yahoo! Groups have rendered this script inoperable since August 2013. We have no current plans to update it to work with the new version, especially because there doesn't seem to be any way to access the raw email messages data any longer, sorry.
yahoo2mbox is a small Perl script which retrieves all messages from a mailing list archive at Yahoo! Groups and stores them into a local file in MBOX format which is recognized by all Unix mail readers and good many of other ones.
If you don't know what Yahoo! Groups are, you probably don't need this program. But if you want to search through the existing archive using your favourite MUA instead of Yahoo interface as I did you might like it.
Notable features include support for localized and age-restricted Yahoo groups. Unfortunately, automatic address unmangling doesn't work any more (as of December 2003 and probably before) because of a change in Yahoo address presentation algorithm.
The latest version is 0.25. Notice that at least the version 0.23 must be used since June 2006 as the earlier versions don't work any longer due to changes to Yahoo Groups pages.
This program is in public domain which basically means that you can do whatever you want with it.
You need Perl 5.004 (it might work with the previous versions but this is the earliest one I tested it with) and a bunch of modules all of which can be retrieved from CPAN including (but possibly not limited to) HTTP::Cookies, LWP::UserAgent and HTML::Parser.
The program has been only tested under Linux with Perl 5.004, 5.005, 5.6, 5.8 and 5.10 and Windows 2000/XP/2003/7 with ActivePerl builds from 631 to 810 but it should work on the other platforms supported by Perl as well. In particular, it is neither Unix nor Windows-specific.
You can get the version 0.25 (released on 2011-03-09) of the script here:
Windows users: if you have never used Perl before, you need to download Perl from, for example, ActiveState and install it. To run this script you should enter perl yahoo2mbox.pl in a command line ("DOS") window and enter all the other parameters afterwards.
Simply run the script giving it the name of the group to download the messages from. If the group archives are limited to the members only you will need to use the --user=member_name and optionally --pass=password options, although only the first one is needed strictly speaking and you will be prompted for the password if you haven't specified it.
The output goes to the local file with the same name as the name of the group by default but this can be changed using the -o output_file option. You can control the range of messages to be retrieved (all by default) using --start and --end options. If the output file already exists, messages are appended to it unless --noresume option is given. By default, resuming starts with the message with index equal to the number of messages already present in the file but this is affected by --start option, i.e. if you started downloading from the message 100 and the process aborted after 10 messages, the next run would resume at message 11 without any special options and at message 111 — as probably needed — only if you specify the same --start=100 option the next time as well.
The other useful options are --proxy=url (you may also include the user name and password if your proxy needs them using the http://user:firstname.lastname@example.org notation) if you're behind a firewall and --cookies if you had previously already logged in to Yahoo using Netscape or yahoo2mbox (it avoids the need to specify the login name and password each time).
If you want to access your country-specific groups you should use the --country option. Please note that only a few countries are currently supported and your help is needed to make this option work for more of them!
The last noteworthy feature is the --x-yahoo option which tells the script to insert X-Yahoo-Message-Num header into all downloaded messages containing the ordinal number of the message in the group. This may be useful to synchronize between the local mailbox and the Yahoo archives, for example.
The most common problem seems to be related to the existence of some kind of download limit put in place by Yahoo. The older versions (before 0.13) of the script used to be very confused by the error page served by Yahoo after a certain amount of bytes (apparently it's counted in bytes and not messages) had been downloaded. The new ones should detect it automatically and stop trying to download anything (what's the point of banging the head against the wall, anyhow) after giving a corresponding error message.
The download limit disappears with time but unfortunately I don't know how long do you have to wait before it does. The only hint I can give is that there are two, apparently independent, download limits: one for the anonymous users and another one for the registered ones. So you could try downloading the messages anonymously and when you hit the limit, switch to using the username and password. Of course, this works only with the groups with public archives.
Additionally, the limit is IP address-specific so if you have a possibility to change your IP address (e.g. you have a dial-up connection) you could try doing this. On the other hand, if you have a direct and fast connection to the internet, using --delay option could be helpful as it seems to bypass at least some of the download limits.
Chris Gamlin has kindly contributed a DOS batch file which combines the advices above and seems to work around the download limits, at least Chris was able to download 60000 messages from the archives of his group using it (even though it took 4 days). Here are the explanations about how to use it from Chris:
The script uses 2 Yahoo usernames/passwords to share the load, so if you only have one username you'll need to register another, although I just used another named profile within the same Yahoo user account as even if one gets locked out, the other seems to continue working. Rename the attached file as yahooscript.bat in a new folder. In the same folder you'll also need the yahoo2mbox.pl file and also a file called sleep.exe which you can officially get in the Windows 2003 Resource kit, but there's a version here that will also do the job: http://www.computerhope.com/dutil.htm Once you have the batch file, the sleep.exe and the yahoo2mbox.pl file in the same location, running the batch file will prompt you for the required information and recommend the delay / download settings I used that seemed to work without overloading the download limits too often. If it does overload and lock out, it will switch to the second account, which should give time for the first one to unlock again. Once the second one locks out, it switches back to the first one again and so on.
- To Malcolm-Rannirl for implementing support for using the Netscape cookies file and more.
- To Dan Libby for the idea of --resume option.
- To Per Bolmstedt for the old semi-manual address unmangling code.
- To Daniel Roethlisberger for country support code.
- To JHB for support of age-restricted Yahoo! groups.
- To Zainul M Charbiwala for implementing automatic address unmangling (unfortunately this doesn't work any longer but it was incredibly useful while it did).
- To Robin Lee Powell for bug reports.
- To Daniel Sutcliffe for various contributions.
If you think your name should be in this list and it is not, please contact me.
- Support Canadian groups with --country=ca.
- Fix title format for the French groups.
- Explicitly require at least Perl 5.6.
- Add --retry option to reget any missing messages (Max Baker).
- Added --debug command line option and save messages which the script failed to parse in files for later analysis (Max Baker).
- Handle indexes beyond the last message correctly (Michael Kielsky).
- Fix the bug in last version which could omit blank line between messages resulting in an invalid MBOX file.
- Updated to work with the latest (as of June 2006) of Yahoo pages.
- Fixed bug resulting in incorrect output MBOX because of unaccounted for spaces in Yahoo web pages (thanks to Anthony Yen, David Silberstein, Daniel Sutcliffe).
- --x-yahoo option is now on by default, use --no-x-yahoo if you don't want the extra headers (thanks to Emmanuel Chantreau for suggestion).
- Added --next option (Brett D. Estrade).
- Further adaptations to new Yahoo Groups pages layout.
- Recognize another error message shown when the download limit is reached (Emmanuel Chantreau).
- Fixed bug to the first message calculation introduced in 0.18.
- Save Yahoo post ids in the output file.