Ever wanted to get more information about known (open source) software?
Ever wondered about how (and if) people reuse their code?
Ever wondered how many duplicated code resides in certain software?
We do.
We are a small team of two obsessed with software metrics, namely software clones.
We enjoy learning about how many clones software projects have.
We enjoy it so much we are making a startup around it with the hypothesis: Can software clone information help us, programmers, be better programmers? Can it save us time? We believe so.
We also believe that clones are not an inherently bad practice.
But do software systems really have duplicated code?
Is it really a real problem?
This will be the first of a series of blog posts target at answering this two questions.
Int this post, we provide an a mini software clonestudy on Doom 3, the awesome Id game.
But first, a little background (If you are a busy reader feel free to skip and return here to understand the terms):
This definitions will help to understand whats to come.
A Software Clone is a duplication of source code where code fragments are copied from one location and pasted to other locations. This can happen intentionally or accidentally.
A Clone Group is a relationship between two or more code fragments where we say that they are clones between them. So if fragment A is cloned to B and B is cloned to C, (A,B,C) is a clone group of 3 code fragments.
The Clone Detection Strategy was strict copy paste. This means that we consider two code fragments duplicated if there are exactly the same (with no variations in terms of strings, identifier names, chars, types, etc).
So, onto Doom 3!
Doom3 was release in 2004 and was an awesome game that sold more 3.5 million copies.
Doom3 was made open source on 22 of November. This gave us a chance to study clone metrics inside a cool software.
So, without further ado, here are some statistics related to the project:
The total number of clone groups represent a measure of how many duplicated fragment relationships there are. This is a virtual number which is only useful when in comparison to the number of clone groups present in other projects.
The largest clone group in terms of number of lines, i.e., the largest copied fragment, has 872 lines. This means that there are code fragments (two to be precise) which have 872 lines in common (duplicated).
This cloned fragment can be viewed here.
The largest clone group in terms of number of code fragments, i.e., the most numbered clone group, has 46 code fragments, where they all have 2 code lines in common. 2 code lines in common is a very weak, possibly not intentional, type of clone. These may not represent intentional cloning and could be removed from this analysis, however they still take part in this study for statistical purposes.
This clone group can also be viewed here
The percentage of cloned code fragments, i.e, total number of lines in all the code fragments detected as clones divided by the total number of lines in doom3, is 24.59%. 24,59% is a high value. This is due to the high number of small (2 to 3 code fragments) and to the existence of a lot of clones.
So Doom 3 has, indeed, a lot of software clones.
However, this is not necessarily bad. Software clones have bad coding practice fame but sometimes the cost of refactorization is greater than the benefit. Also, cloning allows to reduce development time.
( Good article from John Carmack)
A more detailed look at the clones can be observed in this graph:
Each bar reads as such: “There are 702 clone groups where the copied code fragment has 2 lines“.
The graph is in logarithmic scale to better present the results.
Also, the non existing bars have 1 element.
We can see that the number of clone groups, with more cloned lines, decreases.
We can also see that there are a range of large clone fragments (ranging from 101 to 872) with just one clone group. These are more dangerous clones where the copy pasting is more notorious.
We discovered whole files copied with only subtle differences where the maintainance cost would be high.
(left click to visit a folder; right click to return to the parent folder)
Now, to better visualize the cloning information we also present a treemap, with the folders of the project, where greater areas mean a greater number of clones.
The color are just for eye candy purposes.
This way we can match the clones found in this process to the folders they live on.
By visiting this map we can see that there are folders that present a great number of clones: d3xp/physics, d3xp/ai, game/physics, etc.
a fun fact: the sys/win32 folder (where the windows OS specific code resides) has much more clones that sys/linux (roughly 43 less clones).
We showed some clone detection data on the famous doom 3 game.
We have showed a consolidated software which presents almost 25% of code cloning (having in mind the existence of non intencional duplication).
Feel free to signup for our beta testing, where we can offer these visualizations for your project and other clone analysis features, including inconsistent clone management.
Let us manage your clones, so you don’t have to.
Well, that is my very first visit on your weblog! We’re a group of volunteers and beginning a brand new initiative in a group throughout the same niche. Your weblog supplied us useful info to work on. You’ve got carried out a marvellous job! Anyway, in my language, there will not be much good source like this.
A bit surprised it seems to silpme and yet useful.
