MS&E 130/231: Information Systems, Autumn 2005-06

Instructor: Ashish Goel

Handout 9: Homework 3. Given 11/14/05. Due 11/28/05.

 

Collaboration Policy: You are allowed to discuss general strategies for solving a problem with other students in the class, and also to clarify your understanding of the problem. You are not allowed to copy or see somebody else’s homework. For problems 2 and 6, both members of the group should mention each other’s name and attach a copy of the answer with each homework.

 

This is a longer homework, so it is worth 60 points as opposed to the usual 50. The extra credit problem is worth another 25. If your total score on all the HWs is more than 100%, we will reduce it to 100%.

 

  1. A wants to send a file to B. A and B do not trust each other, and would like the file to pass through an intermediary C who will maintain a record of the transaction. A and B do not want C to be able to recover the file (since the information in the file is private). However, in case of future litigation, either A or B must be able to give the original file to C and get C to verify that this was indeed the file that was transferred from A to B. Suggest a simple protocol for implementing this scheme. Ignore network intruders and the like – assume a fully secure channel between any two pairs.
  2. Research the technical design of Skype, with particular emphasis on the role of P2P technology, and the impact of Network Address Translation. Write a brief (one-two page) report. You may do this in groups of two.

3.     We saw an example in class where two web-pages could collude to increase their PageRank by a factor of 1/ε, where ε is the reset probability. Suppose we devise a technique whereby if any K web-pages are colluding, we can detect them. Outline a strategy by which K+1 web-pages can collude together to increase their PageRank. Your solution must have the following features (which you must prove):

a.      Any K web-pages alone appear to be non-colluding, and hence will not be caught by our new technique.

b.     The total PageRank of the K+1 web-pages gets boosted by approximately 1/ ε.

Assume that K << N, the total number of web pages.

  1. Explain how one could use reverse indices to answer a query of the form: Give me all documents containing the word “valley” or any of its synonyms, and the word “glacial”. If you knew in advance that you were going to get a lot of synonym queries, how should you build your reverse indices?
  2. Four merchants, A, B, C, and D are placing a bid for a keyword using Yahoo/Overture or Google. Imagine that the true value of a click is 80 cents for A, 70c for B, 50c for C, and 25c for D. Assume that the minimum allowable bid is 20c. Assume that the click-through rate is 5% for the top result (irrespective of which merchant is at the top), 4% for the second, 2% for the third, and 1% for the fourth. Recall that Overture ranks the merchants in decreasing order of their bids. In this specifc case, Google’s auction would also give the same ranking. For each click, a merchant is charged 1 cent more than the bid made by the next-ranked merchant; the fourth ranked merchant is charged 20c.

a.      If all the merchants bid their true values, what is the expected net profit for each of A, B, C, and D after 100 consumer impressions?

b.     If B, C, and D bid their correct values, what is the expected net profit of A if it bids 60 cents? If it bids 45 cents?  What if A bids 55 cents or 40 cents?

c.      If A bids 60 cents and C, D bid their true values, what is the optimum bid for B?

What can you conclude about the truthfulness of commonly used ad auctions?

  1. Estimate the cost of running a publicly available search engine. Ignore personnel and real-estate costs. Focus instead on the costs of disks, network access, and computers required for each of the three phases: crawling, pre-processing (making the forward and reverse indices and computing the importance), and answering queries. Assume realistic numbers for the size of the web and the number of queries per second. Assume that a new crawl is done every week. Compare this cost to the IT budget of a large company (cite the company and the source of your information). Again, you may do this in groups of two. Hint: Can you simplify the problem by considering the cost of one of the three phases to be roughly the same as that of another phase?
  2. [0 points, but compulsory] We have used telnet many times in class to simulate a tcp client and interact with a server. But we have never had the means to simulate a server and interact with a client. To facilitate this, I have started what is called an “echo” server on my linux computer (kauveri.stanford.edu) at port 8001. Go to your browser, and try to access any web page at the location http://kauveri.stanford.edu:8001. Wait for around 10 seconds. You will get a response back that will be exactly the request that went out from your browser. Print this out and submit it. Observe that this is not all that different or more complicated than the request syntax we used in class over telnet. The echo server is a slight modification of the code in the extra credit problem below.

 

[Extra credit]

 

      Developing a basic TCP server. No collaboration is allowed.

 

 

1)      [0 pts] A Bare-bones server – Manual response: Store the following program in a file (say “tcp_server.sh”) on elaine, and then run it with a port number as an argument (first make it executable using chmod +x, then type ./tcp_server.sh <port> on elaine). Use a number between 20000 and 40000 for the port number.

 

#!/bin/sh

# the next line restarts using -*-Tcl-*-sh \

exec tclsh "$0" ${1+"$@"}

 

# takes listener port number as argument

set port [ lindex $argv 0 ]

 

proc cfg {cid addr port} {

     fileevent $cid readable "handle $cid"

     fconfigure $cid -blocking  off

}

 

proc handle {cid} {

     # blank line closes disconnects

     while { [ gets $cid line ] > 0 } {

        puts $line

        flush stdout

        gets stdin line

        puts $cid $line

        flush $cid

     }

     close $cid

}

 

set cid [ socket -server cfg $port ]

vwait enter-mainloop

 

 

Notice that this is a very simple piece of code. Suppose you ran the server on elaine31. Use the MS-DOS command prompt or another UNIX session to telnet to elaine31 on the port number you used above. Type something in this telnet session, and you will see the same text appear on elaine31. Now type any response on elaine31, and you will see the same response appearing on the telnet session.

 

What you see above is a basic server written in an abstruse but popular programming language Tcl/Tk. Observe how simple it is to make a basic application server that runs over TCP. If one knows Tcl/Tk, then one can just modify the above a little to do many useful things. Since in this class we do not assume that the students know Tcl/Tk (or any specific language), we will provide a more modular interface for building a server using your language of choice.

 

2)      [0 pts] Extending the basic server: Compile the following two programs on elaine31:

 

couple.c:

 

#include <unistd.h>

#include <stdlib.h>

#include <stdio.h>

#include <assert.h>

#include <string.h>

#include <wait.h>

 

main(int argc, char **argv) {

  int pipe1[2], pipe2[2];

  int x, port;

  char command[90];

 

  assert(pipe(pipe1) == 0);

  assert(pipe(pipe2) == 0);

  assert(argc == 4);

 

  port = atoi(argv[2]);

  assert(port <= 40000);

  assert(port > 20000);

  assert(strlen(argv[1]) <= 80);

 

  sprintf(command, "%s %d",argv[1], port);

 

  x = fork();

  assert(x != -1);

  if (x == 0) {

    close(0);

    dup(pipe1[0]);

    close(1);

    dup(pipe2[1]);

    system(command);

  }

  else {

    close(0);

    dup(pipe2[0]);

    close(1);

    dup(pipe1[1]);

    execl(argv[3],argv[3], NULL);

    waitpid(x,0,0);

  }

}

 

and

 

testprimes.c:

 

#include <stdio.h>

 

int main() {

  unsigned long i,n;

 

  while (1){

    scanf("%d",&n);

 

    if (n < 2) {

      printf("Usage: testprimes n where n must be larger than 1\n");

      fflush(stdout);

    }

 

    for (i=2; i*i <= n; ++i) {

      if (n % i == 0) {

        printf("%d\n",i);

        fflush(stdout);

        break;

      }

    }

 

    if (i*i > n && n > 1)

      printf("%d is prime\n",n);

    fflush(stdout);

  }

}

 

We will assume that you compiled both programs so that the first executable is called couple, and the second is called testprimes. Try running testprimes. It just keeps accepting numbers as inputs and keeps outputting whether they are prime (or a factor).

 

The program couple is not something whose details are important. It is just a method for coupling the basic server functionality in part (a) with your application specific code. To see an example, run “couple tcp_server.sh <port> testprimes”. Now when you connect to this server using telnet, if you enter a number in the telnet session, you will get the output of testprimes in the telnet session.

 

Thus, we have a useful server – an application that you can query from anywhere to check primality.

 

3)      [25 pts] Mrs. Weasley’s Clock: Now substitute your own program for testprimes to make a location server. A location server must accept the following two kinds of inputs:

a.       LOCATION <name> <city>

 

Here, we will assume that name and city have a single word each. For example, LOCATION Goel Stanford

The response from the server should just be “Thanks”

 

b.      WHERE <name>

The response from the server must be “I have no idea” if no input of type (a) has been received with the same name, and the corresponding city if such an input has been received.

The location server needs to only remember those locations that it has been told about in the current execution of the server, not from past executions. Also, remember that CTRL-C will kill the program you are running.

 

The location server program need not be in C. If you can make an executable that runs on elaines, then you will be able to use the existing couple and tcp_server.sh programs.