Recharging Bandits

1 517
56.2
Опубликовано 10 апреля 2018, 18:36
We introduce a general model of bandit problems in which the expected payout of an arm is an increasing concave function of the time since it was last played. We first develop approximation algorithms for the underlying optimization problem of determining a reward-maximizing sequence of arm pulls. We then show how to use these algorithms in a learning setting to obtain sublinear regret.

See more at microsoft.com/en-us/research/v...
автотехномузыкадетское