Microsoft Research333 тыс
Опубликовано 10 апреля 2018, 18:36
We introduce a general model of bandit problems in which the expected payout of an arm is an increasing concave function of the time since it was last played. We first develop approximation algorithms for the underlying optimization problem of determining a reward-maximizing sequence of arm pulls. We then show how to use these algorithms in a learning setting to obtain sublinear regret.
See more at microsoft.com/en-us/research/v...
See more at microsoft.com/en-us/research/v...
Свежие видео