Creator |
|
|
|
|
Language |
|
Publisher |
|
|
Date |
|
Source Title |
|
Vol |
|
First Page |
|
Last Page |
|
Publication Type |
|
Access Rights |
|
Crossref DOI |
|
Related DOI |
|
|
Related URI |
|
|
Relation |
|
|
Abstract |
This study is concerned with finite Markov decision processes (MDPs) whose state are exactly observable but its transition matrix is unknown. We develop a learning algorithm of the reward-penalty type... for the communicating case of multi-chain MDPs. An adaptively optimal policy and an asymptotic sequence of adaptive policies with nearly optimal properties are constructed under the average expected reward criterion. Also, a numerical experiment is given to show the practical effectiveness of the algorithm.show more
|