Why implement a new hash system?
As noted in the introduction, the eD2k system is much too vulnerable to survive in an hostile environment like the one that is slowly starting to take shape and gain momentum. A stronger integrity check system with finer granuality are necessary in order to both reduce corruption waste and improve resilience against more subtle attacks.
The immediate advantages would be:
- Scalable chunk sizes based on tree hash verifiability.
- Scalable integrity checking anywhere from 1KB to complete file.
- "Seek and Destroy" minimal waste corruption recovery algorithms.
- Very strong internal integrity (or so I think) to ward off all reasonable cryptanalysis forgery attacks. Even if the base hash becomes readily reversible, the structure should remain sufficiently secure to last until a stronger variant can be widely deployed.
As collateral benefits, MAFS's scalable chunk size capability makes the following possible:
- Variable size distributable framents for faster fragment seeding and to accomodate clients of any speed.
- Shorter queue times and faster queue turn-over rates due to smaller chunks.
- Less re-queueing request overhead due to more frequently active transfers.
- More sources to download from earlier due to the availability of chunk fragments from compatible clients.
- "Almost the same file" matching to acquire sources for file repair and download. (work around minor corruption like changed ID3 tags.)
There are a few things MAFS cannot do (yet):
- Map eD2k hashes to MAFS top hashes since the two systems are completely different.
- Solve the mystery of securely matching MAFS top hashes to eD2k with 100% certainty. This may be only temporary - until search engines start using the new hash system. In the meantime, it could constitute a new attack vector but would be far less trivial to exploit than the peppering attacks.
- The dishes, cleaning, cooking, etc.
I have limited knowledge of alternatives but here is how I think my implementation compares with the two I know:
9500KB chunks can only be verified against the chunk hash. When a chunk hash fails, the chunk is discarded and re-downloaded again and again until it hashes correctly. Assuming a malicious client on dial-up is doing network peppering at 4KB/sec and 200B sessions, this client would be capable of corrupting 20 chunks per second, representing corruption rates of 190MB/second, 680GB/hour, 16TB/day. Imagine what a well-funded and connected distributed peppering farm could do... even a small distributed farm would have the potential of killing the eDonkey network on which eMule is based: 10Mbps, 1KB/session = 1000 sessions per second = 9.5GB corrupted per second, 34TB per hour, 820TB per day per 10Mbps peppering farm. Such farms could effectively cause the majority of clients to re-download one or more of those 9500KB chunks ad-infinitum. Basically, the eDonkey legacy system is very much doomed. The relevant improvements from my system here are:
- Variable chunk sizes to minimize waste on unrecoverable chunks. eDonkey is one of the rare P2P networks using unreasonable chunk sizes. Some patches have been implemented to cut down chunk sizes (crums) but the overall implementation is crummy - or so I heard.
- Intelligent Corruption Handling further reduces the amount that needs to be re-transferred to correct accidental and malicious corruption.
- Improved hash system can detect sub-chunk forgeries that would otherwise be undetectable.
Although already a huge improvement over the eDonkey system, it assumes that Tiger hashes are and will remain unbreakable for the forseeable future. The tree is built from simple direct data hashes and each node is only aware of its two children nodes, meaning that a successful forgery (same hash, 'stealth corruption') of one data block or sub-tree top hash would be undetectable. The Bitzi SHA1 file hash would fail but the tree would be unable to locate the corrupted part. While the basic idea of a tree is neat, a simple tree is no stronger than the hashes used so if Tiger later turns out to be not as secure as it was thought, a simple Tiger Tree could become easy prey. To work around this possibility, my implementation goes a little extra distance:
- Redundant hashing to prevent compromised tree nodes and data blocks from going undetected.
- Extra top tree hash to secure the top tree to guarantee that the top tree is authentic. Since the top tree expands downwards to provide lower level hash authentication, an authentic top needs to be assembled. At this point, you either got an authentic file or an authentic fake. Both start with an 'f', take your pick!
I am sticking with the general tree structure because it is convenient, flexible and efficient. With the extra work I added to the basic tree, the structure as a whole should become much stronger than the hashes it uses, meaning the system should be able to survive even if its base hash algorithm becomes readily compromisable. Chances are that some people can already guess how I plan to do this since there is at least one obvious and simple way to do it and that is what I am working on. I probably could have reused most of the code from a Merkle implementation but since I do not have such code, I will simply keep on whipping up my own since it is already nearing completiion.
I expect to start privately integrating and testing this new hash system with eMule in late 2003 at best. Right now, the most realistic timeline (after seeing how slowly things have been going lately) looks as follows:
- Implementation and API disclosure, already started back in September.
- Closed-source custom demo app for people to play with while I start modding the stock mule, late 2003.
- Closed-source limited pre-alpha modded client testing, Q1 2004.
- Semi-open source broader alpha modded client testing, late Q1 2004.
- Full-release open beta-testing, Q2 2004.
- Definitive implementation, maybe late H1 2004.
It seems my previous Xmas 2003 'most optimistic' estimate was quite far off! (or it may just be that I had an exceptionally crappy and distracting last month that threw my rythm completely off-track, in which case an Xmas alpha might still be possible if I get back on-track with full steam soon.)
This section addresses issues related to how the source code is laid out and other source-code related things.
When Doxygen processes source files, it strips them from the the comment blocks used to generate the documentation you are seeing in these pages. It also optionally strips the standard C/C++ comments to leave only the code.
This is the FAQ section where I dump questions that have too few related issues to earn their own separate FAQs section.
Well, I have been using eMule for most of the last two or so years. (I think the first version I used was 0.23a)
It did what I needed it to do often better than eDonkey. Because it is open-source, it also means I can change it to make it behave more like I want it to. Since it does most of what I need, is generally nice, open-source, etc. I can mod the mule I know or mod a different animal but I am probably too lazy to get involved in new breeds so I decided that it was not necessary to change client/P2P network and started modding the mule. My first (and only so far) significant change being Moonlight's Upload/Download Regulator (MUDR).
While I am stalled working on MAFS, I do random minor projects like Moonlight's Sparse File Support (MSFS) or other micro-mods (one or two GUI controls with not that many more source code lines) to make the mule a little more friendly or less wasteful. These tweaks (once there will be enough of them) will become the Moonlight's Mod.
Most of what you are seeing is automatically generated by Doxygen (www.doxygen.org) from plain source code comments and a few simple control tags. For more information about Doxygen, go to http://www.doxygen.org.
Another good thing about doing it this way is that by documenting as you go, some things that can make you pause and do nothing come together naturally when you think about them while writing documentation and have to explain why one thing was done in a specific manner. Sometimes, this makes you realize that some functions were completely unnecessary and come up with a much better implementation.
Finally, documenting along the way also gives some indication of progress and makes it easier to jump back in the code after not working with it for a while. There are many more advantages but I think this is enough.
One setback of trying to write good documentation is that it takes roughly as much time to write as the code it covers.
Hits since December 5, 2003:
Generated on Fri Dec 5 04:42:55 2003 for Moonlight's eMule Tweaks Documentation by
1.3.4