It is time once again to wite up a post mortem for one of my Advanced Foss projects. For this development cycle I worked on a project called gitsniffer. Gitsnfifer was based on some scripts I developed a while back. The basic idea was that it is possible to extract a web sites source by finding an exposed .git directory hosted on their webserver. I blogged about this a while back. I’m happy to say that this project went way better than my previous one. I screwed up with sensenet, but this time I managed to get a good deal of work done. I also gave a lightning talk about packaging for docker.
The Good
For this project I got a lot done. I built a webscraper in python that uses BeuatifulSoup, Celery and RethinkDB to scrape hacker news for links, and then recursively process those sites looking for .git directories. The scarper works well and will index sites at a pretty good pace.
The Meh
I didn’t get around to writing a fancy web interface for the project. I’m a little sad about this but I plan to keep working on gitsniffer so I’ll write the website eventually.
The Bad
The scraper is pretty dumb, and very aggressive. If it finds a link on hackernews it will not stop until it has checked every single link on that domain. This is not ideal since there are many sites that we can just ignore. I’m planning on adding some smarts to the system later on. Write now it just has a hard coded list of domains it should avoid.
Wrapping up
Over all I’m pretty pleased with how gitsniffer turned out. The scraping backend is now a solid platform to build on. I plan on continuing to work on this and expand it so that it will run on its own and notify people of potentially vulnerable .git directories.