
I did not really pay attention to college basketball this year, so I decided to take a different approach to filling out my bracket.
I started by downloading the full Division 1 Men & # 39; s Basketball schedule (scraped from rivals.yahoo.com), along with the score of each game, the date, and the home team.
In the model, I assume that each team have two (unknown) vectors of real numbers describing how good its liability and defense are in several attributes,. For example, we might want to represent how good the guards on each team are, how Good at the thesis are, and how good the centers are - both at oath and defense.
Offensive: [5, 10, 4]
Defense: [2, 3, 10]
This means that the guards are a 5 on offense and a 2 on defense, etc. In my model, it's easier if we assume that high numbers are better for offsets, and low numbers are better for defenses.
The defensive vector, and vice versa. In our running example, if our team from before played a team with vectors:
Offset: [3, 2, 4]
Defense: [2, 5, 5]
Then the first team & # 39; s score is predicted to be 5 * 2 + 10 * 5 + 4 * 5 = 80
and the second team & # 39; s score is predicted to be 3 * 2 + 2 * 3 + 4 * 10 = 52
What a blowout!
That's we are not actually know the vectors describing each team & # 39; s offense and defense. That & # 39; s OK - we & # 39;
Formally, the goal is matched O and D that minimizes the sum of squared error between predicted scores and observed scores. In math,
sum_g (score_gi - O_i: * D_j :) ^ 2 + (score_gj - O_j: * D_i :) ^ 2
Where I are using this notation that team i played team j in game g (i and j are dependent on g, but I drop this dependence in the notation to keep things simpler) *.
I will not go into the latent vector in order to find the change of the tower of the error behaviors. That bad gradient descent, for the detail-oriented folks out there).
Remember, to predict the first team. # Score against another team, multiply the first team & # 39; s offensive rating (higher is better) by the second team & # 39; s defensive rating (lower is better).
Here are the top 10 offenses and defenses, as learned by the 1D version of my model:
Offenses
North Carolina (9.79462281797)
Pittsburgh (9.77375501699)
Connecticut (9.74628326851)
Memphis (9.71693872544)
Louisville (9.69785532917)
Duke (9.65866585522)
UCLA (9.59945808934)
West Virginia (9.56811566735)
Arizona St. (9.56282860536)
Missouri (9.55043151623)
Defenses
North Carolina (7.02359489844)
Pittsburgh (7.0416251036)
Memphis (7.05499448413)
Connecticut (7.07696194481)
Louisville (7.14778041166)
Duke (7.18950625894)
UCLA (7.21883856723)
Gonzaga (7.22607569868)
Kansas (7.2289767174)
Missouri (7.2395184452)
For each game, I report the predicted score, but for the bracket I just chose the predicted winner.
================== ========= = ROUND 1 ==================== =
Louisville 75.8969699266, Morehead St. Louis 54.31731649
Ohio St. 74.9907105909, Siena 69.6702059811
Utah 69.7205426091, Arizona 69.592708246
Wake Forest 72.3264784371, Cleveland St. 64.3143396939
West Virginia 66.7025939102, Dayton 57.550404701
Kansas 84.0565034675, North Dakota St. Louis. 71.281863854
Boston Coll. 65.0669174572, USC 68.7027018576
Michigan St. 77.3858437718, Robert Morris 59.6407479
Connecticut 91.9763662649, Chattanooga 63.9941388666
BYU 74.7464520646, Texas A & M 70.5677646712
Purdue 69.8634461612, Northern Iowa 59.4892887466
Washington 81.8475059935, Mississippi St. Louis. 74.6374151171
Marquette 73.4307446299, Utah St. 69.1796188404
Missouri 83.8888903275, Cornell 68.1053984941
California 74.9638076999, Maryland 71.2565877894
Memphis 78.3145709447, CSU Northridge 59.0206289492
Pittsburgh 85.5983991252, E. Tennessee St. Louis. 64. 8099546261
Oklahoma St. 81.6131739754, Tennessee 81.8021658489
Florida St. 59.994769086, Wisconsin 60.9139371828
Xavier 77.3537694, Portland St. 63.8161558802
UCLA 76.790261041, VCU 65.2726887151
Villanova 72.9957948506, American 58.6863439306
Texas 64.5805075558, Minnesota 62.3595994418
Duke 85.084666484, Binghamton 61.1984347353
North Carolina 99.2788271609, Radford 69.7291392149
LSU 65.0807263343, Butler 64.9895028812
Illinois 70.6250577544, West. Kentucky 57.6646396014
Gonzaga 75.0447785407, Akron 61.0678281691
Arizona St. 64.7151394863, Temple 58.0578420156
Syracuse 74.7825424779, Stephen F. Austin 60.5056731732
Clemson 74.4054903161, Michigan 70.8395522274
Oklahoma 78.5992492855, Morgan St. 59.7587888038
================== = ROUND 2 ==================== =
Louisville 67.3059313968, Ohio St. Louis 60.5835683909
Utah 71.3007847464, Wake Forest 73.2895225467
West Virginia 67.9574088476, Kansas 67.4869037187
USC 62.1192840465, Michigan St. 64.56295945
Connecticut 76.8719158147, BYU 71.8412099454
Purdue 74.245343296, Washington 73.6100911982
Marquette 76.4607554812, Missouri 80.5497967091
California 64.7143532135, Memphis 70.9373235427
Pittsburgh 79.1278381289, Tennessee 70.6786108051
Wisconsin 63.0943233452, Xavier 63.5379857382
UCLA 74.1282015782, Villanova 71.4919550735
Texas 66.3817261194, Duke 70.9875941571
North Carolina 86.2296333847, LSU 73.8695973309
Illinois 62.6218220536, Gonzaga 65.6078661776
Arizona St. 74.0588194422, Syracuse 71.254787147
Clemson 76.9943827197, Oklahoma 78.9108038697
==================== SWEET 16 ==================== =
Louisville 72.8097088102, Wake Forest 68.2411945982
West Virginia 66.1905929215, Michigan St. 65.2198396254
Connecticut 70.4975234274, Purdue 67.014115714
Missouri 66.6046145365, Memphis 69.9964130636
Pittsburgh 72.8975484716, Xavier 64.848615134
UCLA 72.3676109557, Duke 73.1522519556
North Carolina 84.6606149747, Gonzaga 80.3910425893
Arizona St. 67.8668018941, Oklahoma 67.0441371239
==================== ELITE EIGHT ==================== =
Louisville 64.0822047092, West Virginia 61.7652102534
Connecticut 64.875382557, Memphis 65.9485921907
Pittsburgh 72.8027424093, Duke 70.5222034022
North Carolina 76.2640153058, Arizona St. Louis 72.3363504426
==================== FINAL FOUR ==================== =
Louisville 60.7832463768, Memphis 61.4830569498
Pittsburgh 80.3421788636, North Carolina 81.0056716364
==================== FINAL GAME ==================== =
Memphis 73.8935857273, North Carolina 74.259537592
In the end, these predictions were enough to win my bracket. Obviously, it should all be taken with a grain of salt, but being PhD student in Machine Learning [http://www.machinelearningphdstudent.com/], it was fun to put my money where my mouth was and have a little fun.
Oh, and let me know if you & # 39; d like the data I gothered or the code I wrote to make this work - I & # 39; m happy to share.
* I also regularize the latent vectors by adding the matrix-factorization-like models by encouraging them to be this is known to improve squared L2 norm of the latent vectors. simpler, and less willing to pick up on spurious characteristics of the data.

