共 40 条
- [21] Petroni F., Rocktaschel T., Riedel S., Lewis P., Bakhtin A., Wu Y., Miller A., Language models as knowledge bases?, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2463-2473
- [22] Wei J., Bosma M., Zhao V.Y., Guu K., Yu A.W., Lester B., Du N., Dai A.M., Le Q.V., Finetuned language models are zero-shot learners, arXiv, (2021)
- [23] Apache License, Version 2.0
- [24] MIT License
- [25] Fourrier C., Habib N., Lozovskaya A., Szafer K., Wolf T., Open LLM Leaderboard v2
- [26] Zhou J., Lu T., Mishra S., Brahma S., Basu S., Luan Y., Zhou D., Hou L., Instruction-following evaluation for large language models, arXiv, (2023)
- [27] Suzgun M., Scales N., Scharli N., Gehrmann S., Tay Y., Chung H.W., Chowdhery A., Le Q.V., Chi E.H., Zhou D., Et al., Challenging big-bench tasks and whether chain-of-thought can solve them, arXiv, (2022)
- [28] Rein D., Hou B.L., Stickland A.C., Petty J., Pang R.Y., Dirani J., Michael J., Bowman S.R., GPQA: A graduate-level google-proof Q&A benchmark, arXiv, (2023)
- [29] Sprague Z., Ye X., Bostrom K., Chaudhuri S., Durrett G., MUSR: Testing the limits of chain-of-thought with multistep soft reasoning, arXiv, (2023)
- [30] Wang Y., Ma X., Zhang G., Ni Y., Chandra A., Guo S., Ren W., Arulraj A., He X., Jiang Z., Et al., MMLU-Pro: A more robust and challenging multi-task language understanding benchmark, arXiv, (2024)