|
Post by mikegarrison on Nov 25, 2014 18:06:30 GMT -5
I won't pretend to understand any of this, but maybe this is a good time to ask a question about HCA. Pablo assumes, if I understand it correctly, that HCA is the same for all teams playing at home. But I would think that certain teams, by virtue of location (e.g., mountain schools)or attendance would have markedly better HCAs than other schools. I can understand the difficulty or crafting a HCA specific to a team. But--and here, finally, is the question--does using a HCA that is, out of necessity, the same for everyone create wrong predictions to a significant degree? All models of real-world behavior have uncertainty associated with them. The question is not whether using a constant HCA for pablo gets some answers wrong, the question is whether doing so gives a better overall answer than not doing it at all. I don't think there is enough data available to really model a team-specific HCA. Too many degrees of freedom to get a good answer. Maybe too many to get an answer at all.
|
|
|
Post by The Bofa on the Sofa on Nov 25, 2014 19:59:11 GMT -5
What's interesting is the HCA. Bofa, when you say that with HCA you get a better fit to the input than without it, does that mean you match 87% of the raw wins/losses with your HCA-adjusted methodology or does it mean you match 87% of the HCA-adjusted wins/losses with your HCA-adjusted methodology? If I optimize the ratings including an HCA, I can successfully reflect 87.7% of the results. If I don't include an HCA, I only get 84% or so, despite optimizing. Now, a little of this is the result of using a 5 point precision. When I optimize, I actually use a lot higher precision of the ratings (in fact, a team's rating is basically a random number to the right of the decimal point), but I round everything to the nearest 5 because I don't think any more is justified (although in this type of ranking I guess it would be). Therefore, matches where the teams might be different under high precision turn into ties at low precision. Therefore, if there is no HCA, they stay ties, and I count them wrong. However, with a HCA of 159, that means they aren't ties. But even in the optimal case, if I include the HCA, I can get like 88.6% right. If I don't use an HCA, even with the high precision, the best I can do is still only 87.2%. It makes sense, because if you have a team that beats someone ranked higher at home and loses to someone ranked lower on the road, with no HCA those are two failures, but with a HCA you can maybe even get both right.
|
|
|
Post by gogophers on Nov 25, 2014 20:03:42 GMT -5
Mike, I think your response kind of avoids my question. I'm not looking to eliminate uncertainty. Pablo, by its nature, in drawing on the entire season's worth of matches, obviously cannot take into account whether the line up has changed from before, players are injured who were formerly healthy, or vice versa. So I'm well are that there will always be uncertainty. I'm inquiring about one factor in particular.
All I was asking is whether Bofa believes that the necessity of using the same HCA across the board is a significant weakness, significant in that it explains a material number of those instances in which Pablo's prediction fails. It may well be that even if an across the board HCA is better than no HCA.
Now, the answer may well be that Bofa doesn't know (or care)because he's never run any numbers. But he seems to have anticipated so many variables and spotted so many patterns, it wouldn't surprise me if he has done some limited modeling of a few teams to see whether using a higher or lower HCA would improve the prediction accuracy. As a Washington fan, you can appreciate the significant advantage a high-elevation team like Colorado has when a sea level team like the Huskies come to play--just to use one example.
|
|
|
Post by The Bofa on the Sofa on Nov 25, 2014 20:08:32 GMT -5
Mike, I think your response kind of avoids my question. I'm not looking to eliminate uncertainty. Pablo, by its nature, in drawing on the entire season's worth of matches, obviously cannot take into account whether the line up has changed from before, players are injured who were formerly healthy, or vice versa. So I'm well are that there will always be uncertainty. I'm inquiring about one factor in particular. All I was asking is whether Bofa believes that the necessity of using the same HCA across the board is a significant weakness, significant in that it explains a material number of those instances in which Pablo's prediction fails. It may well be that even if an across the board HCA is better than no HCA. Now, the answer may well be that Bofa doesn't know (or care)because he's never run any numbers. But he seems to have anticipated so many variables and spotted so many patterns, it wouldn't surprise me if he has done some limited modeling of a few teams to see whether using a higher or lower HCA would improve the prediction accuracy. As a Washington fan, you can appreciate the significant advantage a high-elevation team like Colorado has when a sea level team like the Huskies come to play--just to use one example. There's not enough data in a single year to do meaningful individualized HCAs. I've run the experiment for a couple of conferences, and the variation is way out of control. And if you run it for multiple years, you don't see any evidence that the variation is anything but random. In the end, the randomness is so large that it in fact would introduce more errors than just applying a blanket factor. That's not always right, but since you don't actually know who is higher and who is lower, that's the way to bet.
|
|
|
Post by mikegarrison on Nov 25, 2014 20:18:47 GMT -5
What's interesting is the HCA. Bofa, when you say that with HCA you get a better fit to the input than without it, does that mean you match 87% of the raw wins/losses with your HCA-adjusted methodology or does it mean you match 87% of the HCA-adjusted wins/losses with your HCA-adjusted methodology? If I optimize the ratings including an HCA, I can successfully reflect 87.7% of the results. If I don't include an HCA, I only get 84% or so, despite optimizing. Now, a little of this is the result of using a 5 point precision. When I optimize, I actually use a lot higher precision of the ratings (in fact, a team's rating is basically a random number to the right of the decimal point), but I round everything to the nearest 5 because I don't think any more is justified (although in this type of ranking I guess it would be). Therefore, matches where the teams might be different under high precision turn into ties at low precision. Therefore, if there is no HCA, they stay ties, and I count them wrong. However, with a HCA of 159, that means they aren't ties. But even in the optimal case, if I include the HCA, I can get like 88.6% right. If I don't use an HCA, even with the high precision, the best I can do is still only 87.2%. It makes sense, because if you have a team that beats someone ranked higher at home and loses to someone ranked lower on the road, with no HCA those are two failures, but with a HCA you can maybe even get both right. Yes, but I don't think I explained my question correctly. Let's sat the real results were A lost at B B lost at C Now without HCA, you input those into the ranking as B beat A and C beat B. If there were no other matches, the ranking would obviously be: 1 C 2 B 3 A And you would have 100% success matching the real results. Now let's say you have a HCA advantage which explains away both losses, and because of that you end up with 1 A 2 B 3 C Now for the checking the results: Do you check them against a modified set of results, or do you check them against the raw results? If you modify the results you get 100% success, because your modified wins/losses reflects the HCA. But if you compare your new ranking against the raw results, you actually get a 0% success rate. Anyway, that's what I was curious about. When you are measuring your results and see "88.6% right", does that mean you matched 88.6% of the results as modified with an HCA, or does that mean you matched 88.6% of the results just purely W/L as they happened? Another way to ask the question is that in both your lists you have Washington above Colorado. When you are figuring out whether the list correctly matched the results or not, that is obviously a failure when HCA is not accounted for. But in your HCA influenced ranking, do you count that as a success?
|
|
|
Post by mikegarrison on Nov 25, 2014 20:25:22 GMT -5
Mike, I think your response kind of avoids my question. I'm not looking to eliminate uncertainty. Pablo, by its nature, in drawing on the entire season's worth of matches, obviously cannot take into account whether the line up has changed from before, players are injured who were formerly healthy, or vice versa. So I'm well are that there will always be uncertainty. I'm inquiring about one factor in particular. All I was asking is whether Bofa believes that the necessity of using the same HCA across the board is a significant weakness, significant in that it explains a material number of those instances in which Pablo's prediction fails. It may well be that even if an across the board HCA is better than no HCA. Now, the answer may well be that Bofa doesn't know (or care)because he's never run any numbers. But he seems to have anticipated so many variables and spotted so many patterns, it wouldn't surprise me if he has done some limited modeling of a few teams to see whether using a higher or lower HCA would improve the prediction accuracy. As a Washington fan, you can appreciate the significant advantage a high-elevation team like Colorado has when a sea level team like the Huskies come to play--just to use one example. There's not enough data in a single year to do meaningful individualized HCAs. I've run the experiment for a couple of conferences, and the variation is way out of control. And if you run it for multiple years, you don't see any evidence that the variation is anything but random. In the end, the randomness is so large that it in fact would introduce more errors than just applying a blanket factor. That's not always right, but since you don't actually know who is higher and who is lower, that's the way to bet. That's what I meant by too many degrees of freedom. When you can vary both the team rankings and the individual HCA, which one do you vary? Team A lost at team B by a score which would indicate about 50 pablo points different, but is it because team A was 5000 points better but team B's HCA is 5050 points? Or was it because team A was 20 points better but team B's HCA was 70 points? There is an infinite number of possible choices that show the same 50 point difference. But if HCA is a global constant, then the match resolves to a single defined difference between the two teams.
|
|
|
Post by The Bofa on the Sofa on Nov 25, 2014 21:16:00 GMT -5
If I optimize the ratings including an HCA, I can successfully reflect 87.7% of the results. If I don't include an HCA, I only get 84% or so, despite optimizing. Now, a little of this is the result of using a 5 point precision. When I optimize, I actually use a lot higher precision of the ratings (in fact, a team's rating is basically a random number to the right of the decimal point), but I round everything to the nearest 5 because I don't think any more is justified (although in this type of ranking I guess it would be). Therefore, matches where the teams might be different under high precision turn into ties at low precision. Therefore, if there is no HCA, they stay ties, and I count them wrong. However, with a HCA of 159, that means they aren't ties. But even in the optimal case, if I include the HCA, I can get like 88.6% right. If I don't use an HCA, even with the high precision, the best I can do is still only 87.2%. It makes sense, because if you have a team that beats someone ranked higher at home and loses to someone ranked lower on the road, with no HCA those are two failures, but with a HCA you can maybe even get both right. Yes, but I don't think I explained my question correctly. Let's sat the real results were A lost at B B lost at C Now without HCA, you input those into the ranking as B beat A and C beat B. If there were no other matches, the ranking would obviously be: 1 C 2 B 3 A And you would have 100% success matching the real results. Now let's say you have a HCA advantage which explains away both losses, and because of that you end up with 1 A 2 B 3 C Now for the checking the results: Do you check them against a modified set of results, or do you check them against the raw results? If you modify the results you get 100% success, because your modified wins/losses reflects the HCA. But if you compare your new ranking against the raw results, you actually get a 0% success rate. Anyway, that's what I was curious about. When you are measuring your results and see "88.6% right", does that mean you matched 88.6% of the results as modified with an HCA, or does that mean you matched 88.6% of the results just purely W/L as they happened? Another way to ask the question is that in both your lists you have Washington above Colorado. When you are figuring out whether the list correctly matched the results or not, that is obviously a failure when HCA is not accounted for. But in your HCA influenced ranking, do you count that as a success? I still don't know what the hell you are talking about but I think the answer is that if I optimize and include the hca I get 88.6%, but if take those ratings and set the hca to 0, it drops to 83% or so. If I reoptimize holding the hca to 0, it gets back to 87% but never to the point where you can get if you include an hca
|
|
|
Post by mikegarrison on Nov 25, 2014 21:25:07 GMT -5
Yes, but I don't think I explained my question correctly. Let's sat the real results were A lost at B B lost at C Now without HCA, you input those into the ranking as B beat A and C beat B. If there were no other matches, the ranking would obviously be: 1 C 2 B 3 A And you would have 100% success matching the real results. Now let's say you have a HCA advantage which explains away both losses, and because of that you end up with 1 A 2 B 3 C Now for the checking the results: Do you check them against a modified set of results, or do you check them against the raw results? If you modify the results you get 100% success, because your modified wins/losses reflects the HCA. But if you compare your new ranking against the raw results, you actually get a 0% success rate. Anyway, that's what I was curious about. When you are measuring your results and see "88.6% right", does that mean you matched 88.6% of the results as modified with an HCA, or does that mean you matched 88.6% of the results just purely W/L as they happened? Another way to ask the question is that in both your lists you have Washington above Colorado. When you are figuring out whether the list correctly matched the results or not, that is obviously a failure when HCA is not accounted for. But in your HCA influenced ranking, do you count that as a success? I still don't know what the hell you are talking about but I think the answer is that if I optimize and include the hca I get 88.6%, but if take those ratings and set the hca to 0, it drops to 83% or so. If I reoptimize holding the hca to 0, it gets back to 87% but never to the point where you can get if you include an hca OK, this answers my poorly explained question: "but if take those ratings and set the hca to 0, it drops to 83% or so".
|
|
|
Post by pogoball on Nov 26, 2014 0:03:32 GMT -5
I'm trying to get a layman's version of the differences between URS and pablo (and perhaps RPI).
Would it be accurate to say that URS does a better job of reproducing who won the matches, but pablo does a better job of producing the "score" of the match?
In other words, URS can model 9 of 10 match winners, but will have no information or poor information on whether the matches were close. Pablo will model 8 of 10 match winners and additionally give a pretty good idea as to whether the matches were close or not.
|
|
|
Post by The Bofa on the Sofa on Nov 26, 2014 7:54:32 GMT -5
I'm trying to get a layman's version of the differences between URS and pablo (and perhaps RPI). Would it be accurate to say that URS does a better job of reproducing who won the matches, but pablo does a better job of producing the "score" of the match? In other words, URS can model 9 of 10 match winners, but will have no information or poor information on whether the matches were close. Pablo will model 8 of 10 match winners and additionally give a pretty good idea as to whether the matches were close or not. I think the short summary is that while Pablo is designed to give the best predictions of outcomes of matches that have not been played, the URS is designed to give the best reflection of outcomes of matches that have been played. The extent to which Pablo reflects prior wins and losses and to which URS predicts the future is not completely coincidental, but not by any intent. I don't like the distinction that Pablo gives an idea how close they were because that's not the real fundamental difference. The main difference I think is that Pablo takes extreme results and considers them outliers, whereas URS treats them as equal. Therefore, consider my example about of a team that beats #20 and loses to #120. Pablo will put them somewhere in-between, and if you include points, closer to one side or the other. However, URS will put them either above 20 or below 120.
|
|
bluepenquin
Hall of Fame
4-Time VolleyTalk Poster of the Year (2019, 2018, 2017, 2016), All-VolleyTalk 1st Team (2021, 2020, 2019, 2018, 2017, 2016)
Posts: 12,447
|
Post by bluepenquin on Nov 26, 2014 8:42:27 GMT -5
I'm trying to get a layman's version of the differences between URS and pablo (and perhaps RPI). Would it be accurate to say that URS does a better job of reproducing who won the matches, but pablo does a better job of producing the "score" of the match? In other words, URS can model 9 of 10 match winners, but will have no information or poor information on whether the matches were close. Pablo will model 8 of 10 match winners and additionally give a pretty good idea as to whether the matches were close or not. I think the short summary is that while Pablo is designed to give the best predictions of outcomes of matches that have not been played, the URS is designed to give the best reflection of outcomes of matches that have been played. The extent to which Pablo reflects prior wins and losses and to which URS predicts the future is not completely coincidental, but not by any intent. I don't like the distinction that Pablo gives an idea how close they were because that's not the real fundamental difference. The main difference I think is that Pablo takes extreme results and considers them outliers, whereas URS treats them as equal. Therefore, consider my example about of a team that beats #20 and loses to #120. Pablo will put them somewhere in-between, and if you include points, closer to one side or the other. However, URS will put them either above 20 or below 120. This becomes confusing to me. Seems like URS will exclude outliers - or exclude one or more outliers in order to get the highest overall correct %. Pablo may see them as outliers, but would never just exclude an outlier like the match didn't exist?
Back to the Oklahoma/Texas example: URS sees this as an outlier that cannot fit with all other matches. For the greater good - URS ends up cutting their losses and doesn't count this match. Pablo may also view this as an outlier, but match results (game score) impacts their rating. I am probably completely misunderstanding.
|
|
|
Post by The Bofa on the Sofa on Nov 26, 2014 10:15:40 GMT -5
I think the short summary is that while Pablo is designed to give the best predictions of outcomes of matches that have not been played, the URS is designed to give the best reflection of outcomes of matches that have been played. The extent to which Pablo reflects prior wins and losses and to which URS predicts the future is not completely coincidental, but not by any intent. I don't like the distinction that Pablo gives an idea how close they were because that's not the real fundamental difference. The main difference I think is that Pablo takes extreme results and considers them outliers, whereas URS treats them as equal. Therefore, consider my example about of a team that beats #20 and loses to #120. Pablo will put them somewhere in-between, and if you include points, closer to one side or the other. However, URS will put them either above 20 or below 120. This becomes confusing to me. Seems like URS will exclude outliers - or exclude one or more outliers in order to get the highest overall correct %. Pablo may see them as outliers, but would never just exclude an outlier like the match didn't exist?
Back to the Oklahoma/Texas example: URS sees this as an outlier that cannot fit with all other matches. For the greater good - URS ends up cutting their losses and doesn't count this match. Pablo may also view this as an outlier, but match results (game score) impacts their rating. I am probably completely misunderstanding.
Nope, that's exactly it. In URS, outliers either have to be accommodated, treating them not as outliers, or those that can't be accommodated are completely ignored. In Pablo, they still have an impact. Pablo recognizes outliers - but it also tries to make them not as outliers as possible. URS doesn't care. You are work or you don't, and if you don't, you are ignored.
|
|
|
Post by The Bofa on the Sofa on Nov 26, 2014 15:23:27 GMT -5
OK, here's some additional data. Above I posted the performance of a few different models in reflecting matches that have been played RPI Raw 3728 0.823 RPI HCA 3761 0.830 Full Pablo 3734 0.824 UPR 3971 0.877 UPR No HCA 3826 0.845 UPR Time 3957 0.874
Conveniently, since those rankings were based on data through Nov 16, we now have the ability to see how well it predicts results, using the data through yesterday. Here are the prediction results.
RPI Raw 211 0.796 RPI HCA 215 0.811 Full Pablo 217 0.819 UPR 214 0.808 UPR No HCA 194 0.732 UPR Time 208 0.785
With only 265 matches, there aren't large differences in the number of predicted, but their are some, and the trends that are there are consistent with what I've found before. But this gives us an idea.
As expected, Pablo is the best predictive method. RPI with an empirical HCA is just a bit behind, and significantly better than RPI without HCA. UPR is a sniff behind RPI HCA, but that's close. UPR without a HCA is relatively dreadful, and weighting UPR to more recent matches doesn't help predictability.
Before doing any more discussion, I will note that there is one missing piece here, and that is the Pablo version of an ELO method. I am going to run that quick to see how it compares.
|
|
|
Post by The Bofa on the Sofa on Nov 26, 2014 16:47:02 GMT -5
Let me update with the Pablo version of an ELO model (W/L only, no points)
RPI Raw 3728 0.823 RPI HCA 3761 0.830 Full Pablo 3734 0.824 Pablo WL 3754 0.829 UPR 3971 0.877 UPR No HCA 3826 0.845 UPR Time 3957 0.874
Predictions RPI Raw 211 0.796 RPI HCA 215 0.811 Full Pablo 217 0.819 Pablo WL 213 0.804 UPR 214 0.808 UPR No HCA 194 0.732 UPR Time 208 0.785
So small sample set, but here's some thoughts.
If you want the best reflection of what happened: Ultimate Pablo Ranking If you want the best predictor: Full Pablo (you could never make that conclusion based just on this data, but we have the historical basis to conclude that)
A lot of people will say, but hey, can we strike the balance between both? In that case, if you had to choose one, it's the UPR. Far and away the best reflection of what did happen. However, it doesn't compromise much in terms of predictive ability compared to RPI WITH an HCA. And that even assumes you apply an HCA to RPI. If you don't apply an HCA to RPI, it can't do anything as well as UPR or even the Pablo W/L. Pablo W/L isn't nearly the improvement in terms of reflecting what happened as you would think. However, it is still better than RPI because it doesn't suffer near as much from the regional bias. Maybe Massey would do a letter better, I don't know.
So even if you don't like Pablo's emphasis on predictability, there's still no case for using RPI. You can do what RPI is claimed to do much better than RPI does, without much compromise in terms of things that are side-benefits.
And if you want to make a "blended" model based on a mix of prediction and "postdiction", mix up Pablo and the UPR, and don't put RPI into it at all.
|
|
|
Post by BeachbytheBay on Nov 26, 2014 18:37:15 GMT -5
Let me update with the Pablo version of an ELO model (W/L only, no points) RPI Raw 3728 0.823 RPI HCA 3761 0.830 Full Pablo 3734 0.824 Pablo WL 3754 0.829 UPR 3971 0.877 UPR No HCA 3826 0.845 UPR Time 3957 0.874 Predictions RPI Raw 211 0.796 RPI HCA 215 0.811 Full Pablo 217 0.819 Pablo WL 213 0.804 UPR 214 0.808 UPR No HCA 194 0.732 UPR Time 208 0.785 So small sample set, but here's some thoughts. If you want the best reflection of what happened: Ultimate Pablo Ranking If you want the best predictor: Full Pablo (you could never make that conclusion based just on this data, but we have the historical basis to conclude that) A lot of people will say, but hey, can we strike the balance between both? In that case, if you had to choose one, it's the UPR. Far and away the best reflection of what did happen. However, it doesn't compromise much in terms of predictive ability compared to RPI WITH an HCA. And that even assumes you apply an HCA to RPI. If you don't apply an HCA to RPI, it can't do anything as well as UPR or even the Pablo W/L. Pablo W/L isn't nearly the improvement in terms of reflecting what happened as you would think. However, it is still better than RPI because it doesn't suffer near as much from the regional bias. Maybe Massey would do a letter better, I don't know. So even if you don't like Pablo's emphasis on predictability, there's still no case for using RPI. You can do what RPI is claimed to do much better than RPI does, without much compromise in terms of things that are side-benefits. And if you want to make a "blended" model based on a mix of prediction and "postdiction", mix up Pablo and the UPR, and don't put RPI into it at all. have you ever run Massey - would think it is similar to Pablo W/L, and Massey Power similar to Pablo?
|
|