In software development, version control system (VCS) hold an important role. Especially when the project is collaborated by many programmers. Besides to keep tracks of changes, version control could helps handle task distribution and later project integration from the programmers. Basically, there are two flavors of version control system: centralized and distributed. There are many comparison between these two flavors on the net, one of them explained it well with some illustrations. The key point between these two systems is there are local working copies of the project in distributed VCS, while in centralized VCS, every changes must be updated to the central repository. Continue reading
Universities or companies tend to have a very strict Internet access policy. They usually deployed a proxy server to filter and denied access to some websites that they thought could be dangerous to them. Sometimes they just block access to several social networking sites or certain email providers to restrain their network users for wasting their time to do what they called “unproductive things”. When I looked around to find a way to avoid this restriction, I stumbled into this great tools called Tor. Continue reading
Like I tell you on the last post, in order to create automatic part-of-speech tagging for text document, I need to collect some corpora. In fact, because I wanna do it on distributed system, I need a large corpora. One great source to collect corpora is from web. But extracting plain text from HTML manually is quite cumbersome. So I heard that we can use a crawler to extract text from the web. Then I stumbled into Nutch.
A Little About Nutch
Nutch is an open source search engine, builds on Lucene and Solr. According to Tom White, Nutch basically consists of two parts: crawler and searcher. The crawler fetches pages from the web and creates an inverted index from it. The searcher accepts user’s queries to the fetched pages. Nutch can run on a single computer, but also can works great on multinode cluster. Nutch use Hadoop MapReduce in order to work well on distributed environment.
Simple Crawling with Nutch
Let’s get to the point. The objective that I defined here is to make corpora from web pages. In order to achieve that, I’m just gonna crawl some web pages and extract its text. So I’ll not writing about searching for now, but I consider to write it on the other post. Okay, this is my environment when I do this experiment:
- Ubuntu 10.10 Maverick Meerkat
- Java 6 OpenJDK
- Nutch version 1.0, you can download here.
After you’re ready, let’s get started, shall we? Continue reading
About a year ago, I created a blog aggregator, or sometimes also called Planet. This planet display all the blog posts from my registered friends. At first, I did it alone. I maintained and designed it by myself. And then, my friend Andreas wanna help me maintaining the site. So I gave him a role as administrator. Some days ago, he sent me this email:
Do you have opened the website in IE6? The layout and the design looks screwed :(
IE6 or Internet Explorer version 6 is a browser shipped with Windows XP. It was released in 2001. It has better CSS support than the previous version, at that time. The problem with the browser is it’s lacking support on the web standard. If you’re a web designer, you must know the designing problems in IE6 . This browser has bad reputation among web designers. Continue reading
Two months ago, I wrote a simple tutorial on how to create a Hadoop MapReduce program using Netbeans. Not a slightest clue in my head that this post will change my life. Okay, I’m exaggerating. I mean the post change the history of this blog.
The first day after I wrote the post, nothing special happened. But the day after, I was shocked when I checked this blog stats. This blog usually have about 15-25 visitors a day. So I was amazed when I saw 90 visitors that day. I started to investigate which post has the biggest contribution in delivering traffics. And I found out that it was the Hadoop in Netbeans post. I noticed that all of the traffic came from one site called DZone. DZone? I never heard that site before. When I was investigating the site, I found that it’s a cool bookmarking site for developer around the globe. And someone, later I known as mitchp, just share my post into this site.
The magic continued the next day. My traffic keep increasing. Then later in that day, I got an email from my hosting provider:
The domain arifn.web.id has reached 80% of its bandwidth limit (807.50/1000.00 Megs).
Well, my bandwidth limit was just 1 GB a month. So I double checked the hosting package and found that my bandwidth limit should be 2 GB. I contacted my hosting provider to make sure that they didn’t make any mistakes. They said that my bandwidth should be 2 GB and they will resolve it a.s.a.p. Okay, so I started installing WP Super Cache to prevent my site from being down because of the high traffic and low bandwidth limit.
The magic ended. The highest peak was on the third day, near 200 visitors. After that, the traffic declining and found its equilibrium state. But, this state is higher than my average traffic before. My average traffic now is about 20-35 visitors a day. Not bad, huh?
Unfortunately, a completely different story is the mid and long-term impact. By this I mean the number of people that discovered my portal thanks to the link and that has become a frequent visitor of the site since then. This is very difficult to assess (there is no way to know if a new subscriber originally discover your site thanks to the DZone link or it is just a temporal coincidence that he/she joined the site around those dates) but if we look at the increase in the number of subscribers to the RSS portal feeds , my twitter account or the daily visits to the site, my estimation is that only a 2-3% of the original DZone visitors has converted into new portal followers.
I second that. Maybe it’s just a sweet temporal coincidence if my traffic growth above the average. But one thing that I can learn from this experience is that if you want to have a high amount of traffic, you should write a good post regularly. And I hope I can do that.
Do you have the same experience?
When I was in my college, I tried to implement Web Map Service (WMS) and Web Feature Service (WFS) as a foundation for a distributed Geographical Information Systems (or better known as GIS). My academic advisor at that time told me that this idea is not entirely new, but there are still a lot of people didn’t know about it yet. So with this topic as my thesis, he wished that one day people will know about this technology.
The implementation that I made was quite simple actually. But let me tell you the complete story. At first, I was thinking about develop a geographical operation that can be operated via web service in the clouds. After some weeks of analyzing and gathering informations, I found out that this work could be really hard and time consuming. I didn’t have background in geography–I’m a computer science student–and I didn’t have much time before the next graduation. Finally, I just created a spatial data repository and make it accessible across the network using GeoServer,an Open Source implementation of WMS and WFS. I, then, created a simple web application to pull the spatial data and display it to the browser. I also provided a simple data update feature, utilizing one of the feature of WFS. I used OpenLayers to create the application. It’s really simple actually.
In my graduate study, right now, I want to try something entirely different. I want to explore MapReduce, a programming model for processing a large scale of data in a distributed environment. I heard about this model from some mailing lists and websites, surprised that the paper [pdf], the lecture notes and videos are easy to get. So, for the time being, I decided to do some experiments in order to learn something about it.
It’s still a plan in my head actually. I never talked about it to my thesis advisor (because I have none yet). But I can predict some problems that I will be dealing with if I do this research plan. They are:
- The case. I don’t have any idea about the case that I should solve with this research. My college’s advisor told me about doing something in bioinformatics like genome assembling. I think I will cosider it. But I’m open for an idea.
- The machine and its network. The lab are always busy with the other graduate student. Fortunately, one friend of mine told me that there is another place that I can use in the campus to do experiments. But I should create a permission letter first. Okay, I’ll do it.
In the mean time, I’ll focus myself to learn about MapReduce. Maybe I’ll post something about it in this blog. If you have a suggestion about what should I do with this programming model, let me know. I’d be really glad to hear it.
Once in my peaceful days, a friend rang me up. She said that a friend of her got a problem with email. This friend got her email hi-jacked by her ex-boyfriend. So she couldn’t login to her email. That’s just for the start. The bigger problem is this ex also hi-jacked her Facebook account and start doing nasty things with her account. Pretty scary huh? Continue reading