| 作成者 |
|
|
|
|
|
|
|
| 本文言語 |
|
| 出版者 |
|
|
|
| 発行日 |
|
| 収録物名 |
|
| 巻 |
|
| 開始ページ |
|
| 終了ページ |
|
| 出版タイプ |
|
| アクセス権 |
|
| Crossref DOI |
|
| 関連DOI |
|
|
|
| 関連URI |
|
|
|
| 関連情報 |
|
|
|
| 概要 |
This study is concerned with finite Markov decision processes (MDPs) whose state are exactly observable but its transition matrix is unknown. We develop a learning algorithm of the reward-penalty type... for the communicating case of multi-chain MDPs. An adaptively optimal policy and an asymptotic sequence of adaptive policies with nearly optimal properties are constructed under the average expected reward criterion. Also, a numerical experiment is given to show the practical effectiveness of the algorithm.続きを見る
|