Stat 252 and MS&E 238
Class time Spring 2008: tentatively Monday 3:15 - 6:05 pm
Class location Spring 2008: tentatively Gates B03
Data Mining and Electronic Business
This course is about People and Data: Collecting data about behavior on the web, in communication patterns, in social networks, on dating sites, etc. Mining the data, building predictive models, creating (and rejecting) hypotheses, designing cool experiments, and learning from them quickly. And figuring out what is similar to the past and what has changed, what is coming on the horizon, and what the underlying drivers are.
Until about a decade ago, algo differentiator between a good and a bad firm often was theprogress was made in were about progress on algorithms. The last decade has been about progress on data. We will discuss the impact of this communication and data revolution on individuals, business, and society, essentially to most aspects of the world we live in. Applications range from online marketing (behavioral targeting and situational targeting) to architectures leveraging collective intelligence. We are also fortunate to have some great guest speakers come to class. The detailed write-up of each class is created by groups of students on the course wiki. The 2007 course wiki is on the web, and the 2005 and 2004 syllabi might also help with the decision of whether to take this course.
The first half of the quarter focuses on data: Click data (what all can be collected and what it is useful for), intention data (such the queries from the searches you do, we will also discuss social search), attention data (such as tags on social bookmarking sites with its important application for discovery), and interaction data (of email headers and social networking sites). We will also discuss prediction markets as yet another way of gleaning rich data from people. The second half of the quarter focuses on models and on creating appropriate structures and incentives. We will discuss models for products (recommender systems), people (reputation systems), situation and location.
Students are expected to actively engage in class discussions, to have their assumptions challenged, and to bring their various backgrounds to class in order to make it a great experience for themselves and everybody else.
: We meet once a week, Monday afternoon for 3 hours (In 2008, this is Apr 7, 14, 21, 28, May 5, 12, 19, [no class on May 26, Memorial Day] and June 2, and possibly during exam week). This schedule proved useful last year since it makes it as easy as possibly for local students to physically come to class and participate. This is a lot more fun than just watching it on the internet, and you learn a lot more. Note that this explicitly includes SCPD students who only signed up for remote access, just don't tell anyone :) Schedule
: All students have full read/write access to the course wiki at aweigend.wiskispaces.com. I encourage you to really actively contribute -- the class and you will benefit. Course wiki
: The main goal is that you get insights in the area of People and Data, and that you transfer them to your area, hopefully coming up with some interesting ideas and applications. To support this objective, your grade will be determined by the following: Grading
Class wiki: We will form 8 groups, each with around 5 students. Each group is responsible to create the initial wikipage for one of the classes by Friday 6pm (i.e., 4 days after class). These pages are hperlinked, emphasizing the key lernings of each class. [30%]
Homeworks: There will be weekly assignments. They are due the day before class at 5pm, such that we can look through them and give brief feedback in a timely manner. [50%]
The first assignments focus on hands-on experience with data
- understanding your own data (your web logs),
- getting data from other websites by modifying and running a simple spider (example code uses php), or using an API
- running an online advertising campaign using Google AdWords, Yahoo SM, or Miscrosoft adCenter.
- measure its effectiveness and, more broadly, understand what can be tracked easily on your site, using Google Analytics
One assignment leaves lots of space for creativity but also needs to be coded up (there were some amazing entries in 2007):
- write a recommendation system for del.icio.us
We will also
- mine the data of an online social network or a dating site,
- run Cleverset's recommender system on the "network" of sites of students in the class.
When appropriate, papers will be assigned to deepen your understanding.
Class participation. [20%]
Project: If you have a good and solid idea for an interesting project, I am happy to give feedback and jointly decide on whether it makes sense to do the project. I encourage projects in small groups. [optional]
There is no one textbook that will get you through this course. The material is very recent and orgiginates from several academic disciplines (besides statistics and computer science, it discusses modern marketing techniques, customer behavior, and uses social network analysis ideas originating in sociology). psychology Some of the material is very recent, this class necessarily touches many areas, that I do not know of any decent textbook for the class. Depending on your specific background and interests, the following might be useful:
Toby Segaran: Collective Intelligence (2007) Hands on, hacker mentality, includes python code, useful for the del.icio.us recommendation engine homework
P. Baldi, P. Frasconi, and P. Smyth: Modeling the Internet and the Web (2003) Background on web technology, solid statistical modeling of behavior, information retrieval
C. Shapiro, and H.R. Varian: Information Rules (1998) Short book with insights about the networked economy (network effects, economics of digital goods, pricing, etc.)
M.J.A. Berry and G.S. Linoff: Data Mining Techniques (pdf) (2004) Applications of data mining in broad marketing and business in general (not just web)
T. Hastie, R. Tibshirani, and J.H. Friedman: The Elements of Statistcal Learning (2003) The classic for more theoretical aspects in data mining
C.M. Bishop: Pattern Recognition and Machine Learning (2006) Recent book on machine learning from a Bayesian perspective
Papers are included as links in the course wiki.
Room 206 Sequoia Hall,
Office hors: Fri 2:30 - 4:00 (also via Yahoo messenger: stat252spring2008)
Room 238 Sequoia Hall
Office hours: Mon 1:15 - 2:45
Note: The previous version of this page (addressing students considering taking the course) is here.
by | +1 (917) 697-3800 | www.weigend.com