作成者 |
|
|
|
|
本文言語 |
|
出版者 |
|
|
発行日 |
|
収録物名 |
|
巻 |
|
開始ページ |
|
終了ページ |
|
出版タイプ |
|
アクセス権 |
|
Crossref DOI |
|
関連DOI |
|
|
関連URI |
|
|
関連情報 |
|
|
概要 |
This study is concerned with finite Markov decision processes (MDPs) whose state are exactly observable but its transition matrix is unknown. We develop a learning algorithm of the reward-penalty type... for the communicating case of multi-chain MDPs. An adaptively optimal policy and an asymptotic sequence of adaptive policies with nearly optimal properties are constructed under the average expected reward criterion. Also, a numerical experiment is given to show the practical effectiveness of the algorithm.続きを見る
|