Git Mailing List Archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC] statplot: a tool for mining repository statistics.
@ 2007-03-01  7:29 Shawn O. Pearce
  2007-03-01  9:58 ` Alex Riesen
  0 siblings, 1 reply; 6+ messages in thread
From: Shawn O. Pearce @ 2007-03-01  7:29 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

This is a simple program to dump statistics about number of objects
per day, and how much disk space is used per day.  It can be used
to generate data files suitable for plotting repository growth
over time.

This program avoids displaying information about tags, but it does
cover the commits referenced by those tags.  Since annotated tags
are a very small portion of most projects I don't consider this to
be a major issue at this time.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---

 This is on top of Nico's cleanups that are currently in next,
 as I'm using the newer read_sha1_file interface.  Otherwise it
 has no dependencies on stuff in next.

 I am normally very anti-self-linking, but I have posted graphs from
 git.git and linux-2.6.git on my website if people are interested,
 so I'm self-linking:

   http://www.spearce.org/2007/03/git-and-linux-repository-growth.html

 I figured a link in email would be better than an attachment.  ;-)

 This is a toy, but its more than good enough for production, so uh,
 here it is.  Should it be included in git.git?  Don't know....

 .gitignore                     |    1 +
 Documentation/git-statplot.txt |   63 ++++++++
 Makefile                       |    1 +
 builtin-statplot.c             |  318 ++++++++++++++++++++++++++++++++++++++++
 builtin.h                      |    1 +
 git.c                          |    1 +
 6 files changed, 385 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/git-statplot.txt
 create mode 100644 builtin-statplot.c

diff --git a/.gitignore b/.gitignore
index 0eaba0a..4c73efb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -119,6 +119,7 @@ git-ssh-fetch
 git-ssh-pull
 git-ssh-push
 git-ssh-upload
+git-statplot
 git-status
 git-stripspace
 git-svn
diff --git a/Documentation/git-statplot.txt b/Documentation/git-statplot.txt
new file mode 100644
index 0000000..77dd0d4
--- /dev/null
+++ b/Documentation/git-statplot.txt
@@ -0,0 +1,63 @@
+git-statplot(1)
+=========
+
+NAME
+----
+git-statplot - Generate repository statistics for plotting
+
+
+SYNOPSIS
+--------
+'git-statplot' [<commit-ish>*] >repo.dat
+
+DESCRIPTION
+-----------
+Scans all current packs and all reachable objects (optionally
+starting from <object>* rather than all refs) and determines the
+earliest date that each object was referenced by a commit.
+
+For each date, the number of new commits, number of new objects,
+and total bytes used within the packfile are reported to standard
+output.  The output format is suitable for plotting with Gnuplot,
+or any number of other popular plotting and analysis tools.
+
+A sample output follows, taken from the early days of git's own
+history:
+
+....
+	# Total days:    2
+	# Total commits: 5
+	# Date range: ["2005-04-06":"2005-04-07"]
+	# <date> <new_commits> <new_trees> <new_blobs> <new_objects> <new_bytes>
+	2005-04-06 2 2 19 23 8219
+	2005-04-07 3 3 11 17 4742
+....
+
+Here we see that during Git's first day developers created 23 objects
+(2 commits, 2 trees, 19 blobs), and these objects currently occupy
+8,219 bytes of disk space.
+
+During the second day an additional 17 objects and these objects
+currently occupy 4,742 bytes of disk space.
+
+Notes
+-----
+This program only examines packed objects.  If unpacked objects
+are reachable from the input arguments these will not be counted in
+the output.  Running only on a fully packed repository is suggested.
+
+Commit counts, object counts and bytes can vary between runs
+for several reasons.  One cause of this change can be merging of
+long-lived branches. The merged commits will count on the day they
+were committed, not on the day they arrived in this repository.
+Another cause can be from repacking with the -f option, as deltas
+may be recomputed and recompressed, possibly altering the amount
+of space used by an object.
+
+Author
+------
+Written by Shawn O. Pearce <spearce@spearce.org>
+
+GIT
+---
+Part of the gitlink:git[7] suite
diff --git a/Makefile b/Makefile
index e4bb865..83da7fd 100644
--- a/Makefile
+++ b/Makefile
@@ -317,6 +317,7 @@ BUILTIN_OBJS = \
 	builtin-runstatus.o \
 	builtin-shortlog.o \
 	builtin-show-branch.o \
+	builtin-statplot.o \
 	builtin-stripspace.o \
 	builtin-symbolic-ref.o \
 	builtin-tar-tree.o \
diff --git a/builtin-statplot.c b/builtin-statplot.c
new file mode 100644
index 0000000..7f5323f
--- /dev/null
+++ b/builtin-statplot.c
@@ -0,0 +1,318 @@
+/*
+ * Computes the earliest commit that can reach each object, then
+ * summarizes this for each day found within the commit history.
+ *
+ * Produces a datafile to stdout suitable for Gnuplot.
+ */
+#include "cache.h"
+#include "commit.h"
+#include "refs.h"
+#include "pack.h"
+#include "tree-walk.h"
+
+#define SEEN      (1u<<0)
+
+struct packed_object {
+	unsigned char sha1[20];
+	unsigned int packed_bytes;
+	unsigned int type : 2;
+	unsigned int date : 30;
+};
+
+struct days_total {
+	unsigned long packed_bytes;
+	unsigned long new_objects[4];
+};
+
+static const unsigned long one_day = 24 * 60 * 60;
+static struct commit **all_commits;
+static struct packed_object *all_objs;
+static unsigned long nr_commits;
+static unsigned long max_commits;
+static unsigned long nr_objects;
+static unsigned long nr_unpacked;
+static unsigned int earliest_date;
+static unsigned int latest_date;
+
+static int sort_by_sha1(const void *a_, const void *b_)
+{
+	const struct packed_object *a = a_;
+	const struct packed_object *b = b_;
+	return hashcmp(a->sha1, b->sha1);
+}
+
+static int sort_by_offset(const void *a_, const void *b_)
+{
+	const struct packed_object *a = a_;
+	const struct packed_object *b = b_;
+	if (a->packed_bytes < b->packed_bytes)
+		return -1;
+	else if (a->packed_bytes > b->packed_bytes)
+		return 1;
+	return 0;
+}
+
+static void index_one_pack(struct packed_git *p)
+{
+	int i, num = num_packed_objects(p);
+	unsigned long s = nr_objects;
+	struct packed_object *o, *last;
+
+	if (num < 1)
+		return;
+
+	nr_objects += num;
+	all_objs = xrealloc(all_objs, nr_objects * sizeof(*all_objs));
+	for (i = 0, o = all_objs + s; i < num; i++, o++) {
+		nth_packed_object_sha1(p, i, o->sha1);
+		o->packed_bytes = find_pack_entry_one(o->sha1, p);
+		o->type = 0;
+		o->date = 0;
+	}
+
+	qsort(all_objs + s, num, sizeof(*all_objs), sort_by_offset);
+
+	last = all_objs + s;
+	for (i = 1, o = last + 1; i < num; i++, o++) {
+		last->packed_bytes = o->packed_bytes - last->packed_bytes;
+		last = o;
+	}
+	last->packed_bytes = p->pack_size - last->packed_bytes - 32;
+}
+
+static void scan_commit(struct commit *commit)
+{
+	if (!commit || commit->object.flags & SEEN)
+		return;
+
+	parse_commit(commit);
+	commit->object.flags |= SEEN;
+
+	if (nr_commits == max_commits) {
+		max_commits *= 2;
+		if (max_commits < 1024)
+			max_commits = 1024;
+		all_commits = xrealloc(all_commits,
+			sizeof(*all_commits) * max_commits);
+	}
+	all_commits[nr_commits++] = commit;
+}
+
+static int stat_handle_reflog_ent(unsigned char *osha1,
+	unsigned char *nsha1,
+	const char *email,
+	unsigned long timestamp,
+	int tz,
+	const char *message,
+	void *cb_data)
+{
+	scan_commit(lookup_commit_reference_gently(nsha1, 1));
+	return 0;
+}
+
+static int stat_handle_reflog(const char *logname,
+	const unsigned char *sha1,
+	int flag,
+	void *cb_data)
+{
+	for_each_reflog_ent(logname, stat_handle_reflog_ent, NULL);
+	return 0;
+}
+
+static int stat_handle_ref(const char *refname,
+	const unsigned char *sha1,
+	int flag,
+	void *cb_data)
+{
+	scan_commit(lookup_commit_reference_gently(sha1, 1));
+	return 0;
+}
+
+static int sort_by_cdate(const void *a_, const void *b_)
+{
+	const struct commit *a = *((const struct commit**)a_);
+	const struct commit *b = *((const struct commit**)b_);
+	if (a->date < b->date)
+		return -1;
+	else if (a->date > b->date)
+		return 1;
+	return 0;
+}
+
+static void list_all_commits()
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_commits; i++) {
+		struct commit_list *parents = all_commits[i]->parents;
+		while (parents) {
+			scan_commit(parents->item);
+			parents = parents->next;
+		}
+	}
+
+	qsort(all_commits, nr_commits, sizeof(*all_commits), sort_by_cdate);
+}
+
+static struct packed_object* find_one(const unsigned char *sha1)
+{
+	unsigned long hi = nr_objects;
+	unsigned long lo = 0;
+	do {
+		unsigned long mi = (lo + hi) / 2;
+		int cmp = hashcmp(all_objs[mi].sha1, sha1);
+		if (!cmp)
+			return all_objs + mi;
+		if (cmp > 0)
+			hi = mi;
+		else
+			lo = mi + 1;
+	} while (lo < hi);
+	return NULL;
+}
+
+static int mark_one(
+	const unsigned char *sha1,
+	unsigned int date,
+	enum object_type type)
+{
+	struct packed_object *obj = find_one(sha1);
+	if (!obj) {
+		nr_unpacked++;
+		return 1;
+	}
+	if (obj->type)
+		return 1;
+	obj->date = date;
+	obj->type = type;
+	return 0;
+}
+
+static void mark_tree(const unsigned char *sha1, unsigned int date)
+{
+	struct tree_desc desc;
+	struct name_entry name_entry;
+	enum object_type type;
+	void *buf;
+
+	if (mark_one(sha1, date, OBJ_TREE))
+		return;
+
+	buf = read_sha1_file(sha1, &type, &desc.size);
+	if (!buf || type != OBJ_TREE)
+		die("missing/corrupt tree object %s", sha1_to_hex(sha1));
+	desc.buf = buf;
+	while (tree_entry(&desc, &name_entry)) {
+		if (S_ISDIR(name_entry.mode))
+			mark_tree(name_entry.sha1, date);
+		else
+			mark_one(name_entry.sha1, date, OBJ_BLOB);
+	}
+	free(buf);
+}
+
+static void compute_earliest_dates()
+{
+	unsigned long i;
+
+	if (!nr_commits)
+		die("No commits?");
+
+	for (i = 0; i < nr_commits; i++) {
+		struct commit *commit = all_commits[i];
+		unsigned int day = commit->date / one_day;
+		mark_one(commit->object.sha1, day, OBJ_COMMIT);
+		mark_tree(commit->tree->object.sha1, day);
+	}
+
+	earliest_date = all_commits[0]->date / one_day;
+	latest_date = all_commits[nr_commits - 1]->date / one_day;
+}
+
+static void print_yyyymmdd(unsigned int day)
+{
+	time_t date = day * one_day;
+	struct tm *when = localtime(&date);
+	printf("%4.4i-%2.2i-%2.2i",
+		when->tm_year + 1900,
+		when->tm_mon + 1,
+		when->tm_mday);
+}
+
+static void compute_date_buckets()
+{
+	unsigned long days = latest_date - earliest_date + 1;
+	unsigned long i;
+	struct days_total *totals;
+	struct packed_object *o;
+
+	totals = xcalloc(days, sizeof(*totals));
+	for (i = 0, o = all_objs; i < nr_objects; i++, o++) {
+		if (o->date) {
+			struct days_total* t = &totals[o->date - earliest_date];
+			t->packed_bytes += o->packed_bytes;
+			t->new_objects[OBJ_NONE]++;
+			t->new_objects[o->type]++;
+		}
+	}
+
+	printf("# Total days:    %lu\n", days);
+	printf("# Total commits: %lu\n", nr_commits);
+	printf("# Date range: [\"");
+	print_yyyymmdd(earliest_date);
+	printf("\":\"");
+	print_yyyymmdd(latest_date);
+	printf("\"]\n");
+
+	printf("# <date> <new_commits> <new_trees> <new_blobs>"
+		" <new_objects> <new_bytes>\n");
+	for (i = 0; i < days; i++) {
+		print_yyyymmdd(earliest_date + i);
+		printf(" %lu %lu %lu %lu %lu\n",
+			totals[i].new_objects[OBJ_COMMIT],
+			totals[i].new_objects[OBJ_TREE],
+			totals[i].new_objects[OBJ_BLOB],
+			totals[i].new_objects[OBJ_NONE],
+			totals[i].packed_bytes);
+	}
+
+	free(totals);
+}
+
+int cmd_statplot(int argc, char **argv, const char *prefix)
+{
+	int i;
+	struct packed_git *p;
+
+	for (i = 1; i < argc; i++) {
+		const char *arg = argv[i];
+		unsigned char head_sha1[20];
+
+		if (*arg == '-')
+			continue;
+		else if (!get_sha1(arg, head_sha1))
+			scan_commit(lookup_commit_reference(head_sha1));
+		else
+			error("invalid parameter: expected sha1, got '%s'", arg);
+	}
+
+	if (!nr_commits) {
+		for_each_ref(stat_handle_ref, NULL);
+		for_each_reflog(stat_handle_reflog, NULL);
+	}
+
+	prepare_packed_git();
+	for (p = packed_git; p; p = p->next)
+		index_one_pack(p);
+	qsort(all_objs, nr_objects, sizeof(*all_objs), sort_by_sha1);
+
+	list_all_commits();
+	compute_earliest_dates();
+	compute_date_buckets();
+
+	if (nr_unpacked)
+		warn("%lu unpacked objects were not included in output.\n",
+			nr_unpacked);
+
+	return 0;
+}
diff --git a/builtin.h b/builtin.h
index 48d68c8..8a6ee95 100644
--- a/builtin.h
+++ b/builtin.h
@@ -66,6 +66,7 @@ extern int cmd_runstatus(int argc, const char **argv, const char *prefix);
 extern int cmd_shortlog(int argc, const char **argv, const char *prefix);
 extern int cmd_show(int argc, const char **argv, const char *prefix);
 extern int cmd_show_branch(int argc, const char **argv, const char *prefix);
+extern int cmd_statplot(int argc, const char **argv, const char *prefix);
 extern int cmd_stripspace(int argc, const char **argv, const char *prefix);
 extern int cmd_symbolic_ref(int argc, const char **argv, const char *prefix);
 extern int cmd_tar_tree(int argc, const char **argv, const char *prefix);
diff --git a/git.c b/git.c
index 008e016..994f22e 100644
--- a/git.c
+++ b/git.c
@@ -278,6 +278,7 @@ static void handle_internal_command(int argc, const char **argv, char **envp)
 		{ "shortlog", cmd_shortlog, RUN_SETUP | USE_PAGER },
 		{ "show-branch", cmd_show_branch, RUN_SETUP },
 		{ "show", cmd_show, RUN_SETUP | USE_PAGER },
+		{ "statplot", cmd_statplot, RUN_SETUP },
 		{ "stripspace", cmd_stripspace },
 		{ "symbolic-ref", cmd_symbolic_ref, RUN_SETUP },
 		{ "tar-tree", cmd_tar_tree },
-- 
1.5.0.2.809.g0f936

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] statplot: a tool for mining repository statistics.
  2007-03-01  7:29 [PATCH/RFC] statplot: a tool for mining repository statistics Shawn O. Pearce
@ 2007-03-01  9:58 ` Alex Riesen
  2007-03-01 18:30   ` Shawn O. Pearce
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Riesen @ 2007-03-01  9:58 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git

On 3/1/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> This is a simple program to dump statistics about number of objects
> per day, and how much disk space is used per day.  It can be used
> to generate data files suitable for plotting repository growth
> over time.
>
> This program avoids displaying information about tags, but it does
> cover the commits referenced by those tags.  Since annotated tags
> are a very small portion of most projects I don't consider this to
> be a major issue at this time.
>

Tried it on my repo but the program produced some records with
all zeros. I.e.:
2005-11-10 6 222 235 463 1785888
2005-11-11 3 34 2 39 9729
2005-11-12 0 0 0 0 0
2005-11-13 0 0 0 0 0
2005-11-14 6 227 330 563 716230
2005-11-15 3 121 209 333 1162839

What could this possibly mean?

>  This is a toy, but its more than good enough for production, so uh,
>  here it is.  Should it be included in git.git?  Don't know....
>

Definitely! It's interesting, and fun.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] statplot: a tool for mining repository statistics.
  2007-03-01  9:58 ` Alex Riesen
@ 2007-03-01 18:30   ` Shawn O. Pearce
  2007-03-01 21:36     ` Alex Riesen
  2007-03-02 18:32     ` Jakub Narebski
  0 siblings, 2 replies; 6+ messages in thread
From: Shawn O. Pearce @ 2007-03-01 18:30 UTC (permalink / raw
  To: Alex Riesen; +Cc: Junio C Hamano, git

Alex Riesen <raa.lkml@gmail.com> wrote:
> Tried it on my repo but the program produced some records with
> all zeros. I.e.:
> 2005-11-10 6 222 235 463 1785888
> 2005-11-11 3 34 2 39 9729
> 2005-11-12 0 0 0 0 0
> 2005-11-13 0 0 0 0 0
> 2005-11-14 6 227 330 563 716230
> 2005-11-15 3 121 209 333 1162839
> 
> What could this possibly mean?

There were no new objects attributed to 12 and 13 of Nov, 2005.
The program allocates a zero'd array for every day in the entire
range, then increments each day with the number of objects introduced
by commit(s) on that day.  It then just prints the entire array,
rather than only printing the non-zero lines.  ;-)

Horribly simple code.  Maybe too dumb for inclusion in git.git.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] statplot: a tool for mining repository statistics.
  2007-03-01 18:30   ` Shawn O. Pearce
@ 2007-03-01 21:36     ` Alex Riesen
  2007-03-02 18:32     ` Jakub Narebski
  1 sibling, 0 replies; 6+ messages in thread
From: Alex Riesen @ 2007-03-01 21:36 UTC (permalink / raw
  To: Shawn O. Pearce; +Cc: Junio C Hamano, git

On 3/1/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > Tried it on my repo but the program produced some records with
> > all zeros. I.e.:
> > 2005-11-10 6 222 235 463 1785888
> > 2005-11-11 3 34 2 39 9729
> > 2005-11-12 0 0 0 0 0
> > 2005-11-13 0 0 0 0 0
> > 2005-11-14 6 227 330 563 716230
> > 2005-11-15 3 121 209 333 1162839
> >
> > What could this possibly mean?
>
> There were no new objects attributed to 12 and 13 of Nov, 2005.
> The program allocates a zero'd array for every day in the entire
> range, then increments each day with the number of objects introduced
> by commit(s) on that day.  It then just prints the entire array,
> rather than only printing the non-zero lines.  ;-)

oh, ok.

> Horribly simple code.  Maybe too dumb for inclusion in git.git.

I still think it's good for something (I used it to quickly find days
for weekly and monthly report).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] statplot: a tool for mining repository statistics.
  2007-03-01 18:30   ` Shawn O. Pearce
  2007-03-01 21:36     ` Alex Riesen
@ 2007-03-02 18:32     ` Jakub Narebski
  2007-03-02 19:10       ` Johannes Schindelin
  1 sibling, 1 reply; 6+ messages in thread
From: Jakub Narebski @ 2007-03-02 18:32 UTC (permalink / raw
  To: git

Shawn O. Pearce wrote:

> Horribly simple code.  Maybe too dumb for inclusion in git.git.

Perhaps it would be better to include it (first) in contrib/ section, then?
It looks more like contrib, by the way, than something than should be in
git.git core...
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH/RFC] statplot: a tool for mining repository statistics.
  2007-03-02 18:32     ` Jakub Narebski
@ 2007-03-02 19:10       ` Johannes Schindelin
  0 siblings, 0 replies; 6+ messages in thread
From: Johannes Schindelin @ 2007-03-02 19:10 UTC (permalink / raw
  To: Jakub Narebski; +Cc: git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 585 bytes --]

Hi,

On Fri, 2 Mar 2007, Jakub Narebski wrote:

> Shawn O. Pearce wrote:
> 
> > Horribly simple code.  Maybe too dumb for inclusion in git.git.
> 
> Perhaps it would be better to include it (first) in contrib/ section, 
> then? It looks more like contrib, by the way, than something than should 
> be in git.git core...

It depends not only on libgit.a, but also on being a builtin. I actually 
_like_ writing builtins better than non-builtins, since the infrastructure 
is so simple.

If now only a unified option parser would appear out of the void, that'd 
be perfect.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-03-02 19:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-01  7:29 [PATCH/RFC] statplot: a tool for mining repository statistics Shawn O. Pearce
2007-03-01  9:58 ` Alex Riesen
2007-03-01 18:30   ` Shawn O. Pearce
2007-03-01 21:36     ` Alex Riesen
2007-03-02 18:32     ` Jakub Narebski
2007-03-02 19:10       ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).