epoll的內核實現

epoll是由一組系統調用組成。
???? int epoll_create(int size);
???? int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
???? int epoll_wait(int epfd, struct epoll_event *events,int maxevents, int timeout);
???? select/poll的缺點在于:
???? 1.每次調用時要重復地從用戶態讀入參數。
???? 2.每次調用時要重復地掃描文件描述符。
???? 3.每次在調用開始時,要把當前進程放入各個文件描述符的等待隊列。在調用結束后,又把進程從各個等待隊列中刪除。
???? 在實際應用中,select/poll監視的文件描述符可能會非常多,如果每次只是返回一小部分,那么,這種情況下select/poll

顯得不夠高效。epoll的設計思路,是把select/poll單個的操作拆分為1個epoll_create+多個epoll_ctrl+一個epoll_wait。

epoll機制實現了自己特有的文件系統eventpoll filesystem

[cpp]?view plaincopy
  1. /*?File?callbacks?that?implement?the?eventpoll?file?behaviour?*/??
  2. static?const?struct?file_operations?eventpoll_fops?=?{??
  3. ????.release????=?ep_eventpoll_release,??
  4. ????.poll???????=?ep_eventpoll_poll??
  5. };??

epoll_create創建一個屬于該文件系統的文件,然后返回其文件描述符。

?

struct eventpoll 保存了epoll文件節點的擴展信息,該結構保存于file結構體的private_data域中,每個epoll_create創建的epoll

描述符都分配一個該結構體。該結構的各個成員的定義如下,注釋也很詳細。

[cpp]?view plaincopy
  1. /*?
  2. ?*?This?structure?is?stored?inside?the?"private_data"?member?of?the?file?
  3. ?*?structure?and?rapresent?the?main?data?sructure?for?the?eventpoll?
  4. ?*?interface.?
  5. ?*/??
  6. struct?eventpoll?{??
  7. ????/*?Protect?the?this?structure?access,可用于中斷上下文?*/??
  8. ????spinlock_t?lock;??
  9. ????/*?
  10. ?????*?This?mutex?is?used?to?ensure?that?files?are?not?removed?
  11. ?????*?while?epoll?is?using?them.?This?is?held?during?the?event?
  12. ?????*?collection?loop,?the?file?cleanup?path,?the?epoll?file?exit?
  13. ?????*?code?and?the?ctl?operations.用戶進程上下文中?
  14. ?????*/??
  15. ????struct?mutex?mtx;??
  16. ????/*?Wait?queue?used?by?sys_epoll_wait()?*/??
  17. ????wait_queue_head_t?wq;??
  18. ????/*?Wait?queue?used?by?file->poll()?*/??
  19. ????wait_queue_head_t?poll_wait;??
  20. ????/*?List?of?ready?file?descriptors?*/??
  21. ????struct?list_head?rdllist;??
  22. ????/*?RB?tree?root?used?to?store?monitored?fd?structs?*/??
  23. ????struct?rb_root?rbr;??
  24. ????/*?
  25. ?????*?This?is?a?single?linked?list?that?chains?all?the?"struct?epitem"?that?
  26. ?????*?happened?while?transfering?ready?events?to?userspace?w/out?
  27. ?????*?holding?->lock.?
  28. ?????*/??
  29. ????struct?epitem?*ovflist;??
  30. ????/*?The?user?that?created?the?eventpoll?descriptor?*/??
  31. ????struct?user_struct?*user;??
  32. };??

?

而通過epoll_ctl接口加入該epoll描述符監聽的套接字則屬于socket filesystem,這點一定要注意。每個添加的待監聽(這里監聽

和listen調用不同)都對應于一個epitem結構體,該結構體已紅黑樹的結構組織,eventpoll結構中保存了樹的根節點(rbr成員)。

同時有監聽事件到來的套接字的該結構以雙向鏈表組織起來,鏈表頭也保存在eventpoll中(rdllist成員)。

[c-sharp]?view plaincopy
  1. /*?
  2. ?*?Each?file?descriptor?added?to?the?eventpoll?interface?will?
  3. ?*?have?an?entry?of?this?type?linked?to?the?"rbr"?RB?tree.?
  4. ?*/??
  5. struct?epitem?{??
  6. ????/*?RB?tree?node?used?to?link?this?structure?to?the?eventpoll?RB?tree?*/??
  7. ????struct?rb_node?rbn;??
  8. ????/*?List?header?used?to?link?this?structure?to?the?eventpoll?ready?list?*/??
  9. ????struct?list_head?rdllink;??
  10. ????/*?
  11. ?????*?Works?together?"struct?eventpoll"->ovflist?in?keeping?the?
  12. ?????*?single?linked?chain?of?items.?
  13. ?????*/??
  14. ????struct?epitem?*next;??
  15. ????/*?The?file?descriptor?information?this?item?refers?to?*/??
  16. ????struct?epoll_filefd?ffd;??
  17. ????/*?Number?of?active?wait?queue?attached?to?poll?operations?*/??
  18. ????int?nwait;??
  19. ????/*?List?containing?poll?wait?queues?*/??
  20. ????struct?list_head?pwqlist;??
  21. ????/*?The?"container"?of?this?item?*/??
  22. ????struct?eventpoll?*ep;??
  23. ????/*?List?header?used?to?link?this?item?to?the?"struct?file"?items?list?*/??
  24. ????struct?list_head?fllink;??
  25. ????/*?The?structure?that?describe?the?interested?events?and?the?source?fd?*/??
  26. ????struct?epoll_event?event;??
  27. };??

?

epoll_create的調用很簡單,就是創建一個epollevent的文件,并返回文件描述符。

epoll_ctl用來添加,刪除以及修改監聽項。

[c-sharp]?view plaincopy
  1. /*?
  2. ?*?The?following?function?implements?the?controller?interface?for?
  3. ?*?the?eventpoll?file?that?enables?the?insertion/removal/change?of?
  4. ?*?file?descriptors?inside?the?interest?set.?
  5. ?*/??
  6. SYSCALL_DEFINE4(epoll_ctl,?int,?epfd,?int,?op,?int,?fd,??
  7. ????????struct?epoll_event?__user?*,?event)??
  8. {??
  9. ????int?error;??
  10. ????struct?file?*file,?*tfile;??
  11. ????struct?eventpoll?*ep;??
  12. ????struct?epitem?*epi;??
  13. ????struct?epoll_event?epds;??
  14. ????DNPRINTK(3,?(KERN_INFO?"[%p]?eventpoll:?sys_epoll_ctl(%d,?%d,?%d,?%p)/n",??
  15. ?????????????current,?epfd,?op,?fd,?event));??
  16. ????error?=?-EFAULT;??
  17. ????if?(ep_op_has_event(op)?&&??
  18. ????????copy_from_user(&epds,?event,?sizeof(struct?epoll_event)))??
  19. ????????goto?error_return;??
  20. ????/*?Get?the?"struct?file?*"?for?the?eventpoll?file?*/??
  21. ????error?=?-EBADF;??
  22. ????file?=?fget(epfd);??
  23. ????if?(!file)??
  24. ????????goto?error_return;??
  25. ????/*?Get?the?"struct?file?*"?for?the?target?file?*/??
  26. ????tfile?=?fget(fd);??
  27. ????if?(!tfile)??
  28. ????????goto?error_fput;??
  29. ????/*?The?target?file?descriptor?must?support?poll?*/??
  30. ????error?=?-EPERM;??
  31. ????if?(!tfile->f_op?||?!tfile->f_op->poll)??
  32. ????????goto?error_tgt_fput;??
  33. ????/*?
  34. ?????*?We?have?to?check?that?the?file?structure?underneath?the?file?descriptor?
  35. ?????*?the?user?passed?to?us?_is_?an?eventpoll?file.?And?also?we?do?not?permit?
  36. ?????*?adding?an?epoll?file?descriptor?inside?itself.?
  37. ?????*/??
  38. ????error?=?-EINVAL;??
  39. ????if?(file?==?tfile?||?!is_file_epoll(file))??
  40. ????????goto?error_tgt_fput;??
  41. ????/*?
  42. ?????*?At?this?point?it?is?safe?to?assume?that?the?"private_data"?contains?
  43. ?????*?our?own?data?structure.?
  44. ?????*/??
  45. ????ep?=?file->private_data;??
  46. ????mutex_lock(&ep->mtx);??
  47. ????/*?
  48. ?????*?Try?to?lookup?the?file?inside?our?RB?tree,?Since?we?grabbed?"mtx"?
  49. ?????*?above,?we?can?be?sure?to?be?able?to?use?the?item?looked?up?by?
  50. ?????*?ep_find()?till?we?release?the?mutex.?
  51. ?????*/??
  52. ????epi?=?ep_find(ep,?tfile,?fd);??
  53. ????error?=?-EINVAL;??
  54. ????switch?(op)?{??
  55. ????case?EPOLL_CTL_ADD:??
  56. ????????if?(!epi)?{??
  57. ????????????epds.events?|=?POLLERR?|?POLLHUP;??
  58. ????????????error?=?ep_insert(ep,?&epds,?tfile,?fd);??
  59. ????????}?else??
  60. ????????????error?=?-EEXIST;??
  61. ????????break;??
  62. ????case?EPOLL_CTL_DEL:??
  63. ????????if?(epi)??
  64. ????????????error?=?ep_remove(ep,?epi);??
  65. ????????else??
  66. ????????????error?=?-ENOENT;??
  67. ????????break;??
  68. ????case?EPOLL_CTL_MOD:??
  69. ????????if?(epi)?{??
  70. ????????????epds.events?|=?POLLERR?|?POLLHUP;??
  71. ????????????error?=?ep_modify(ep,?epi,?&epds);??
  72. ????????}?else??
  73. ????????????error?=?-ENOENT;??
  74. ????????break;??
  75. ????}??
  76. ????mutex_unlock(&ep->mtx);??
  77. error_tgt_fput:??
  78. ????fput(tfile);??
  79. error_fput:??
  80. ????fput(file);??
  81. error_return:??
  82. ????DNPRINTK(3,?(KERN_INFO?"[%p]?eventpoll:?sys_epoll_ctl(%d,?%d,?%d,?%p)?=?%d/n",??
  83. ?????????????current,?epfd,?op,?fd,?event,?error));??
  84. ????return?error;??
  85. }??

同樣,代碼很清楚。先來看看添加流程

[c-sharp]?view plaincopy
  1. /*?
  2. ?*?Must?be?called?with?"mtx"?held.?
  3. ?*/??
  4. static?int?ep_insert(struct?eventpoll?*ep,?struct?epoll_event?*event,??
  5. ?????????????struct?file?*tfile,?int?fd)??
  6. {??
  7. ????int?error,?revents,?pwake?=?0;??
  8. ????unsigned?long?flags;??
  9. ????struct?epitem?*epi;??
  10. ????struct?ep_pqueue?epq;??
  11. ????????/*?不允許超過最大監聽個數*/??
  12. ????if?(unlikely(atomic_read(&ep->user->epoll_watches)?>=??
  13. ?????????????max_user_watches))??
  14. ????????return?-ENOSPC;??
  15. ????if?(!(epi?=?kmem_cache_alloc(epi_cache,?GFP_KERNEL)))??
  16. ????????return?-ENOMEM;??
  17. ????/*?Item?initialization?follow?here?...?*/??
  18. ????INIT_LIST_HEAD(&epi->rdllink);??
  19. ????INIT_LIST_HEAD(&epi->fllink);??
  20. ????INIT_LIST_HEAD(&epi->pwqlist);??
  21. ????epi->ep?=?ep;??
  22. ????ep_set_ffd(&epi->ffd,?tfile,?fd);??
  23. ????epi->event?=?*event;??
  24. ????epi->nwait?=?0;??
  25. ????epi->next?=?EP_UNACTIVE_PTR;??
  26. ????/*?Initialize?the?poll?table?using?the?queue?callback?*/??
  27. ????epq.epi?=?epi;??
  28. ????init_poll_funcptr(&epq.pt,?ep_ptable_queue_proc);??
  29. ????/*?
  30. ?????*?Attach?the?item?to?the?poll?hooks?and?get?current?event?bits.?
  31. ?????*?We?can?safely?use?the?file*?here?because?its?usage?count?has?
  32. ?????*?been?increased?by?the?caller?of?this?function.?Note?that?after?
  33. ?????*?this?operation?completes,?the?poll?callback?can?start?hitting?
  34. ?????*?the?new?item.?
  35. ?????*/??
  36. ????revents?=?tfile->f_op->poll(tfile,?&epq.pt);??
  37. ????/*?
  38. ?????*?We?have?to?check?if?something?went?wrong?during?the?poll?wait?queue?
  39. ?????*?install?process.?Namely?an?allocation?for?a?wait?queue?failed?due?
  40. ?????*?high?memory?pressure.?
  41. ?????*/??
  42. ????error?=?-ENOMEM;??
  43. ????if?(epi->nwait?<?0)??
  44. ????????goto?error_unregister;??
  45. ????/*?Add?the?current?item?to?the?list?of?active?epoll?hook?for?this?file?*/??
  46. ????spin_lock(&tfile->f_ep_lock);??
  47. ????list_add_tail(&epi->fllink,?&tfile->f_ep_links);??
  48. ????spin_unlock(&tfile->f_ep_lock);??
  49. ????/*?
  50. ?????*?Add?the?current?item?to?the?RB?tree.?All?RB?tree?operations?are?
  51. ?????*?protected?by?"mtx",?and?ep_insert()?is?called?with?"mtx"?held.?
  52. ?????*/??
  53. ????ep_rbtree_insert(ep,?epi);??
  54. ????/*?We?have?to?drop?the?new?item?inside?our?item?list?to?keep?track?of?it?*/??
  55. ????spin_lock_irqsave(&ep->lock,?flags);??
  56. ????/*?If?the?file?is?already?"ready"?we?drop?it?inside?the?ready?list?*/??
  57. ????if?((revents?&?event->events)?&&?!ep_is_linked(&epi->rdllink))?{??
  58. ????????list_add_tail(&epi->rdllink,?&ep->rdllist);??
  59. ????????/*?Notify?waiting?tasks?that?events?are?available?*/??
  60. ????????if?(waitqueue_active(&ep->wq))??
  61. ????????????wake_up_locked(&ep->wq);??
  62. ????????if?(waitqueue_active(&ep->poll_wait))??
  63. ????????????pwake++;??
  64. ????}??
  65. ????spin_unlock_irqrestore(&ep->lock,?flags);??
  66. ????atomic_inc(&ep->user->epoll_watches);??
  67. ????/*?We?have?to?call?this?outside?the?lock?*/??
  68. ????if?(pwake)??
  69. ????????ep_poll_safewake(&psw,?&ep->poll_wait);??
  70. ????DNPRINTK(3,?(KERN_INFO?"[%p]?eventpoll:?ep_insert(%p,?%p,?%d)/n",??
  71. ?????????????current,?ep,?tfile,?fd));??
  72. ????return?0;??
  73. error_unregister:??
  74. ????ep_unregister_pollwait(ep,?epi);??
  75. ????/*?
  76. ?????*?We?need?to?do?this?because?an?event?could?have?been?arrived?on?some?
  77. ?????*?allocated?wait?queue.?Note?that?we?don't?care?about?the?ep->ovflist?
  78. ?????*?list,?since?that?is?used/cleaned?only?inside?a?section?bound?by?"mtx".?
  79. ?????*?And?ep_insert()?is?called?with?"mtx"?held.?
  80. ?????*/??
  81. ????spin_lock_irqsave(&ep->lock,?flags);??
  82. ????if?(ep_is_linked(&epi->rdllink))??
  83. ????????list_del_init(&epi->rdllink);??
  84. ????spin_unlock_irqrestore(&ep->lock,?flags);??
  85. ????kmem_cache_free(epi_cache,?epi);??
  86. ????return?error;??
  87. }??

init_poll_funcptr函數注冊poll table回調函數。然后程序的下一步是調用tfile的poll函數,并且poll函數的第2個參數為poll table,

這是epoll機制中唯一對監聽套接字調用poll時第2個參數不為NULL的時機。ep_ptable_queue_proc函數的作用是注冊等待函數

并添加到指定的等待隊列,所以在第一次調用后,該信息已經存在了,無需在poll函數中再次調用了。

[c-sharp]?view plaincopy
  1. /*?
  2. ?*?This?is?the?callback?that?is?used?to?add?our?wait?queue?to?the?
  3. ?*?target?file?wakeup?lists.?
  4. ?*/??
  5. static?void?ep_ptable_queue_proc(struct?file?*file,?wait_queue_head_t?*whead,??
  6. ?????????????????poll_table?*pt)??
  7. {??
  8. ????struct?epitem?*epi?=?ep_item_from_epqueue(pt);??
  9. ????struct?eppoll_entry?*pwq;??
  10. ????if?(epi->nwait?>=?0?&&?(pwq?=?kmem_cache_alloc(pwq_cache,?GFP_KERNEL)))?{??
  11. ????????????????/*?為監聽套接字注冊一個等待回調函數,在喚醒時調用*/??
  12. ????????init_waitqueue_func_entry(&pwq->wait,?ep_poll_callback);??
  13. ????????pwq->whead?=?whead;??
  14. ????????pwq->base?=?epi;??
  15. ????????add_wait_queue(whead,?&pwq->wait);??
  16. ????????list_add_tail(&pwq->llink,?&epi->pwqlist);??
  17. ????????epi->nwait++;??
  18. ????}?else?{??
  19. ????????/*?We?have?to?signal?that?an?error?occurred?*/??
  20. ????????epi->nwait?=?-1;??
  21. ????}??
  22. }??

?

那么該poll函數到底是怎樣的呢,這就要看我們在傳入到epoll_ctl前創建的套接字的類型(socket調用)。對于創建的tcp套接字

來說,可以按照創建流程找到其對應得函數是tcp_poll。

tcp_poll的主要功能為:

  1. 如果poll table回調函數存在(ep_ptable_queue_proc),則調用它來等待。注意這只限第一次調用,在后面的poll中都無需此步
  2. 判斷事件的到達。(根據tcp的相關成員)

tcp_poll注冊到的等待隊列是sock成員的sk_sleep,等待隊列在對應的IO事件中被喚醒。當等待隊列被喚醒時會調用相應的等待回調函數

,前面看到我們注冊的是函數ep_poll_callback。該函數可能在中斷上下文中調用。

[c-sharp]?view plaincopy
  1. /*?
  2. ?*?This?is?the?callback?that?is?passed?to?the?wait?queue?wakeup?
  3. ?*?machanism.?It?is?called?by?the?stored?file?descriptors?when?they?
  4. ?*?have?events?to?report.?
  5. ?*/??
  6. static?int?ep_poll_callback(wait_queue_t?*wait,?unsigned?mode,?int?sync,?void?*key)??
  7. {??
  8. ????int?pwake?=?0;??
  9. ????unsigned?long?flags;??
  10. ????struct?epitem?*epi?=?ep_item_from_wait(wait);??
  11. ????struct?eventpoll?*ep?=?epi->ep;??
  12. ????DNPRINTK(3,?(KERN_INFO?"[%p]?eventpoll:?poll_callback(%p)?epi=%p?ep=%p/n",??
  13. ?????????????current,?epi->ffd.file,?epi,?ep));??
  14. ????????/*?對eventpoll的spinlock加鎖,因為是在中斷上下文中*/??
  15. ????spin_lock_irqsave(&ep->lock,?flags);??
  16. ????/*?沒有事件到來?
  17. ?????*?If?the?event?mask?does?not?contain?any?poll(2)?event,?we?consider?the?
  18. ?????*?descriptor?to?be?disabled.?This?condition?is?likely?the?effect?of?the?
  19. ?????*?EPOLLONESHOT?bit?that?disables?the?descriptor?when?an?event?is?received,?
  20. ?????*?until?the?next?EPOLL_CTL_MOD?will?be?issued.?
  21. ?????*/??
  22. ????if?(!(epi->event.events?&?~EP_PRIVATE_BITS))??
  23. ????????goto?out_unlock;??
  24. ????/*?
  25. ?????*?If?we?are?trasfering?events?to?userspace,?we?can?hold?no?locks?
  26. ?????*?(because?we're?accessing?user?memory,?and?because?of?linux?f_op->poll()?
  27. ?????*?semantics).?All?the?events?that?happens?during?that?period?of?time?are?
  28. ?????*?chained?in?ep->ovflist?and?requeued?later?on.?
  29. ?????*/??
  30. ????if?(unlikely(ep->ovflist?!=?EP_UNACTIVE_PTR))?{??
  31. ????????if?(epi->next?==?EP_UNACTIVE_PTR)?{??
  32. ????????????epi->next?=?ep->ovflist;??
  33. ????????????ep->ovflist?=?epi;??
  34. ????????}??
  35. ????????goto?out_unlock;??
  36. ????}??
  37. ????/*?If?this?file?is?already?in?the?ready?list?we?exit?soon?*/??
  38. ????if?(ep_is_linked(&epi->rdllink))??
  39. ????????goto?is_linked;??
  40. ????????/*?加入ready?queue*/??
  41. ????list_add_tail(&epi->rdllink,?&ep->rdllist);??
  42. is_linked:??
  43. ????/*?
  44. ?????*?Wake?up?(?if?active?)?both?the?eventpoll?wait?list?and?the?->poll()?
  45. ?????*?wait?list.?
  46. ?????*/??
  47. ????if?(waitqueue_active(&ep->wq))??
  48. ????????wake_up_locked(&ep->wq);??
  49. ????if?(waitqueue_active(&ep->poll_wait))??
  50. ????????pwake++;??
  51. out_unlock:??
  52. ????spin_unlock_irqrestore(&ep->lock,?flags);??
  53. ????/*?We?have?to?call?this?outside?the?lock?*/??
  54. ????if?(pwake)??
  55. ????????ep_poll_safewake(&psw,?&ep->poll_wait);??
  56. ????return?1;??
  57. }??

?

注意這里有2中隊列,一種是在epoll_wait調用中使用的eventpoll的等待隊列,用于判斷是否有監聽套接字可用,一種是對應于每個套接字

的等待隊列sk_sleep,用于判斷每個監聽套接字上事件,該隊列喚醒后調用ep_poll_callback,在該函數中又調用wakeup函數來喚醒前一種

隊列,來通知epoll_wait調用進程。

[cpp]?view plaincopy
  1. static?int?ep_poll(struct?eventpoll?*ep,?struct?epoll_event?__user?*events,??
  2. ???????????int?maxevents,?long?timeout)??
  3. {??
  4. ????int?res,?eavail;??
  5. ????unsigned?long?flags;??
  6. ????long?jtimeout;??
  7. ????wait_queue_t?wait;??
  8. ????/*?
  9. ?????*?Calculate?the?timeout?by?checking?for?the?"infinite"?value?(?-1?)?
  10. ?????*?and?the?overflow?condition.?The?passed?timeout?is?in?milliseconds,?
  11. ?????*?that?why?(t?*?HZ)?/?1000.?
  12. ?????*/??
  13. ????jtimeout?=?(timeout?<?0?||?timeout?>=?EP_MAX_MSTIMEO)????
  14. ????????MAX_SCHEDULE_TIMEOUT?:?(timeout?*?HZ?+?999)?/?1000;??
  15. retry:??
  16. ????spin_lock_irqsave(&ep->lock,?flags);??
  17. ????res?=?0;??
  18. ????if?(list_empty(&ep->rdllist))?{??
  19. ????????/*?
  20. ?????????*?We?don't?have?any?available?event?to?return?to?the?caller.?
  21. ?????????*?We?need?to?sleep?here,?and?we?will?be?wake?up?by?
  22. ?????????*?ep_poll_callback()?when?events?will?become?available.?
  23. ?????????*/??
  24. ????????init_waitqueue_entry(&wait,?current);??
  25. ????????wait.flags?|=?WQ_FLAG_EXCLUSIVE;??
  26. ????????__add_wait_queue(&ep->wq,?&wait);??
  27. ????????for?(;;)?{??
  28. ????????????/*?
  29. ?????????????*?We?don't?want?to?sleep?if?the?ep_poll_callback()?sends?us?
  30. ?????????????*?a?wakeup?in?between.?That's?why?we?set?the?task?state?
  31. ?????????????*?to?TASK_INTERRUPTIBLE?before?doing?the?checks.?
  32. ?????????????*/??
  33. ????????????set_current_state(TASK_INTERRUPTIBLE);??
  34. ????????????if?(!list_empty(&ep->rdllist)?||?!jtimeout)??
  35. ????????????????break;??
  36. ????????????if?(signal_pending(current))?{??
  37. ????????????????res?=?-EINTR;??
  38. ????????????????break;??
  39. ????????????}??
  40. ????????????spin_unlock_irqrestore(&ep->lock,?flags);??
  41. ????????????jtimeout?=?schedule_timeout(jtimeout);??
  42. ????????????spin_lock_irqsave(&ep->lock,?flags);??
  43. ????????}??
  44. ????????__remove_wait_queue(&ep->wq,?&wait);??
  45. ????????set_current_state(TASK_RUNNING);??
  46. ????}??
  47. ????/*?Is?it?worth?to?try?to?dig?for?events???*/??
  48. ????eavail?=?!list_empty(&ep->rdllist);??
  49. ????spin_unlock_irqrestore(&ep->lock,?flags);??
  50. ????/*?
  51. ?????*?Try?to?transfer?events?to?user?space.?In?case?we?get?0?events?and?
  52. ?????*?there's?still?timeout?left?over,?we?go?trying?again?in?search?of?
  53. ?????*?more?luck.?
  54. ?????*/??
  55. ????if?(!res?&&?eavail?&&??
  56. ????????!(res?=?ep_send_events(ep,?events,?maxevents))?&&?jtimeout)??
  57. ????????goto?retry;??
  58. ????return?res;??
  59. }??

該函數是在epoll_wait中調用的等待函數,其等待被ep_poll_callback喚醒,然后調用ep_send_events來把到達事件copy到用戶空間,然后

epoll_wait才返回。

?

最后我們來看看ep_poll_callback函數和ep_send_events函數的同步,因為他們都要操作ready queue。

eventpoll中巧妙地設置了2種類型的鎖,一個是mtx,是個mutex類型,是對該描述符操作的基本同步鎖,可以睡眠;所以又存在了另外一個

鎖,lock,它是一個spinlock類型,不允許睡眠,所以用在ep_poll_callback中,注意mtx不能用于此。

注意由于ep_poll_callback函數中會涉及到對eventpoll的ovflist和rdllist成員的訪問,所以在任意其它地方要訪問時都要先加mxt,在加lock鎖。

?

由于中斷的到來時異步的,為了方便,先看ep_send_events函數。

[cpp]?view plaincopy
  1. static?int?ep_send_events(struct?eventpoll?*ep,?struct?epoll_event?__user?*events,??
  2. ??????????????int?maxevents)??
  3. {??
  4. ????int?eventcnt,?error?=?-EFAULT,?pwake?=?0;??
  5. ????unsigned?int?revents;??
  6. ????unsigned?long?flags;??
  7. ????struct?epitem?*epi,?*nepi;??
  8. ????struct?list_head?txlist;??
  9. ????INIT_LIST_HEAD(&txlist);??
  10. ????/*?
  11. ?????*?We?need?to?lock?this?because?we?could?be?hit?by?
  12. ?????*?eventpoll_release_file()?and?epoll_ctl(EPOLL_CTL_DEL).?
  13. ?????*/??
  14. ????mutex_lock(&ep->mtx);??
  15. ????/*?
  16. ?????*?Steal?the?ready?list,?and?re-init?the?original?one?to?the?
  17. ?????*?empty?list.?Also,?set?ep->ovflist?to?NULL?so?that?events?
  18. ?????*?happening?while?looping?w/out?locks,?are?not?lost.?We?cannot?
  19. ?????*?have?the?poll?callback?to?queue?directly?on?ep->rdllist,?
  20. ?????*?because?we?are?doing?it?in?the?loop?below,?in?a?lockless?way.?
  21. ?????*/??
  22. ????spin_lock_irqsave(&ep->lock,?flags);??
  23. ????list_splice(&ep->rdllist,?&txlist);??
  24. ????INIT_LIST_HEAD(&ep->rdllist);??
  25. ????ep->ovflist?=?NULL;??
  26. ????spin_unlock_irqrestore(&ep->lock,?flags);??
  27. ????/*?
  28. ?????*?We?can?loop?without?lock?because?this?is?a?task?private?list.?
  29. ?????*?We?just?splice'd?out?the?ep->rdllist?in?ep_collect_ready_items().?
  30. ?????*?Items?cannot?vanish?during?the?loop?because?we?are?holding?"mtx".?
  31. ?????*/??
  32. ????for?(eventcnt?=?0;?!list_empty(&txlist)?&&?eventcnt?<?maxevents;)?{??
  33. ????????epi?=?list_first_entry(&txlist,?struct?epitem,?rdllink);??
  34. ????????list_del_init(&epi->rdllink);??
  35. ????????/*?
  36. ?????????*?Get?the?ready?file?event?set.?We?can?safely?use?the?file?
  37. ?????????*?because?we?are?holding?the?"mtx"?and?this?will?guarantee?
  38. ?????????*?that?both?the?file?and?the?item?will?not?vanish.?
  39. ?????????*/??
  40. ????????revents?=?epi->ffd.file->f_op->poll(epi->ffd.file,?NULL);??
  41. ????????revents?&=?epi->event.events;??
  42. ????????/*?
  43. ?????????*?Is?the?event?mask?intersect?the?caller-requested?one,?
  44. ?????????*?deliver?the?event?to?userspace.?Again,?we?are?holding?
  45. ?????????*?"mtx",?so?no?operations?coming?from?userspace?can?change?
  46. ?????????*?the?item.?
  47. ?????????*/??
  48. ????????if?(revents)?{??
  49. ????????????if?(__put_user(revents,??
  50. ???????????????????????&events[eventcnt].events)?||??
  51. ????????????????__put_user(epi->event.data,??
  52. ???????????????????????&events[eventcnt].data))??
  53. ????????????????goto?errxit;??
  54. ????????????if?(epi->event.events?&?EPOLLONESHOT)??
  55. ????????????????epi->event.events?&=?EP_PRIVATE_BITS;??
  56. ????????????eventcnt++;??
  57. ????????}??
  58. ????????/*?
  59. ?????????*?At?this?point,?noone?can?insert?into?ep->rdllist?besides?
  60. ?????????*?us.?The?epoll_ctl()?callers?are?locked?out?by?us?holding?
  61. ?????????*?"mtx"?and?the?poll?callback?will?queue?them?in?ep->ovflist.?
  62. ?????????*/??
  63. ????????if?(!(epi->event.events?&?EPOLLET)?&&??
  64. ????????????(revents?&?epi->event.events))??
  65. ????????????list_add_tail(&epi->rdllink,?&ep->rdllist);??
  66. ????}??
  67. ????error?=?0;??
  68. errxit:??
  69. ????spin_lock_irqsave(&ep->lock,?flags);??
  70. ????/*?
  71. ?????*?During?the?time?we?spent?in?the?loop?above,?some?other?events?
  72. ?????*?might?have?been?queued?by?the?poll?callback.?We?re-insert?them?
  73. ?????*?inside?the?main?ready-list?here.?
  74. ?????*/??
  75. ????for?(nepi?=?ep->ovflist;?(epi?=?nepi)?!=?NULL;??
  76. ?????????nepi?=?epi->next,?epi->next?=?EP_UNACTIVE_PTR)?{??
  77. ????????/*?
  78. ?????????*?If?the?above?loop?quit?with?errors,?the?epoll?item?might?still?
  79. ?????????*?be?linked?to?"txlist",?and?the?list_splice()?done?below?will?
  80. ?????????*?take?care?of?those?cases.?
  81. ?????????*/??
  82. ????????if?(!ep_is_linked(&epi->rdllink))??
  83. ????????????list_add_tail(&epi->rdllink,?&ep->rdllist);??
  84. ????}??
  85. ????/*?
  86. ?????*?We?need?to?set?back?ep->ovflist?to?EP_UNACTIVE_PTR,?so?that?after?
  87. ?????*?releasing?the?lock,?events?will?be?queued?in?the?normal?way?inside?
  88. ?????*?ep->rdllist.?
  89. ?????*/??
  90. ????ep->ovflist?=?EP_UNACTIVE_PTR;??
  91. ????/*?
  92. ?????*?In?case?of?error?in?the?event-send?loop,?or?in?case?the?number?of?
  93. ?????*?ready?events?exceeds?the?userspace?limit,?we?need?to?splice?the?
  94. ?????*?"txlist"?back?inside?ep->rdllist.?
  95. ?????*/??
  96. ????list_splice(&txlist,?&ep->rdllist);??
  97. ????if?(!list_empty(&ep->rdllist))?{??
  98. ????????/*?
  99. ?????????*?Wake?up?(if?active)?both?the?eventpoll?wait?list?and?the?->poll()?
  100. ?????????*?wait?list?(delayed?after?we?release?the?lock).?
  101. ?????????*/??
  102. ????????if?(waitqueue_active(&ep->wq))??
  103. ????????????wake_up_locked(&ep->wq);??
  104. ????????if?(waitqueue_active(&ep->poll_wait))??
  105. ????????????pwake++;??
  106. ????}??
  107. ????spin_unlock_irqrestore(&ep->lock,?flags);??
  108. ????mutex_unlock(&ep->mtx);??
  109. ????/*?We?have?to?call?this?outside?the?lock?*/??
  110. ????if?(pwake)??
  111. ????????ep_poll_safewake(&psw,?&ep->poll_wait);??
  112. ????return?eventcnt?==?0???error:?eventcnt;??
  113. }??

該函數的注釋也很清晰,不過我們從總體上分析下。

?

首先函數加mtx鎖,這時必須的。

然后得工作是要讀取ready queue,但是中斷會寫這個成員,所以要加spinlock;但是接下來的工作會sleep,所以在整個loop都加spinlock顯然

會阻塞ep_poll_callback函數,從而阻塞中斷,這是個很不好的行為,也不可取。于是epoll中在eventpoll中設置了另一個成員ovflist。在讀取ready

queue前,我們設置該成員為NULL,然后就可以釋放spinlock了。為什么這樣可行呢,因為對應的,在ep_poll_callback中,獲取spinlock后,對于

到達的事件并不總是放入ready queue,而是先判斷ovflist是否為EP_UNACTIVE_PTR。

[cpp]?view plaincopy
  1. if?(unlikely(ep->ovflist?!=?EP_UNACTIVE_PTR))?{??
  2. /*?進入此處說明用用戶進程在調用ep_poll_callback,所以把事件加入ovflist中,而不是ready?queue中*/??
  3. ????????if?(epi->next?==?EP_UNACTIVE_PTR)?{/*?如果此處條件不成立,說明該epi已經在ovflist中,所以直接返回*/??
  4. ????????????epi->next?=?ep->ovflist;??
  5. ????????????ep->ovflist?=?epi;??
  6. ????????}??
  7. ????????goto?out_unlock;??
  8. ????}??

?

所以在此期間,到達的事件放入了ovflist中。當loop結束后,函數接著遍歷該list,添加到ready queue中,最后設置ovflist為EP_UNACTIVE_PTR,

這樣下次中斷中的事件可以放入ready queue了。最后判斷是否有其他epoll_wait調用被阻塞,則喚醒。

?

?

?

從源代碼中,可以看出epoll的幾大優點:

  1. 用戶傳入的信息保存在內核中了,無需每次傳入
  2. 事件監聽機制不在是 整個監聽隊列,而是每個監聽套接字在有事件到達時通過等待回調函數異步通知epoll,然后再返回給用戶。

同時epoll中的同步機制也是一個內核編程的設計經典,值得深入理解。


epoll描述

?

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/445033.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/445033.shtml
英文地址,請注明出處:http://en.pswp.cn/news/445033.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

算法(14)-數據結構-二叉樹

leetcode-explore-learn-數據結構-二叉樹10.概述1.深度優先遍歷dfs1.1先序遍歷-中左右1.2中序遍歷-左中右1.3后序遍歷-左右中2.廣度優先遍歷bfs3.遍歷-常見問題3.1 二叉樹的最大深度自頂向下自底向上3.2對稱二叉樹3.3路徑總和4.重構-常見問題4.1根據中序和后序遍歷序列構造二叉…

多進程魚多線程的權衡選擇

最近有好多人在網上問道做游戲開發框架用多線程還是多進程呢,或者兩者之間的優缺點,等等類似的問題。下邊小高就帶您小小分析一下: 1、首先要明確進程和線程的含義:進程(Process)是具有一定獨立功能的程序關于某個數據集合上的一次運行活動,是系統進行資源分配和調度的一…

leetcode322 零錢兌換

給定不同面額的硬幣 coins 和一個總金額 amount。編寫一個函數來計算可以湊成總金額所需的最少的硬幣個數。如果沒有任何一種硬幣組合能組成總金額&#xff0c;返回 -1。 示例 1: 輸入: coins [1, 2, 5], amount 11 輸出: 3 解釋: 11 5 5 1 示例 2: 輸入: coins [2],…

給數據減肥 讓MySQL數據庫跑的更快

在數據庫優化工作中&#xff0c;使數據盡可能的小&#xff0c;使表在硬盤上占據的空間盡可能的小&#xff0c;這是最常用、也是最有效的手段之一。因為縮小數據&#xff0c;相對來說可以提高硬盤的讀寫速度&#xff0c;并且在查詢過程中小表的內容處理時所占用的系統資源比較少…

算法(15)-leetcode-explore-learn-數據結構-運用遞歸解決二叉樹的問題

leetcode-explore-learn-數據結構-二叉樹2本系列博文為leetcode-explore-learn子欄目學習筆記&#xff0c;如有不詳之處&#xff0c;請參考leetcode官網&#xff1a;https://leetcode-cn.com/explore/learn/card/data-structure-binary-tree/2/traverse-a-tree/7/

leetcode538 把二叉搜索樹轉換成累加樹

給定一個二叉搜索樹&#xff08;Binary Search Tree&#xff09;&#xff0c;把它轉換成為累加樹&#xff08;Greater Tree)&#xff0c;使得每個節點的值是原來的節點值加上所有大于它的節點值之和。 對于每一個點來說&#xff0c;自己的父&#xff0c;和自己父的右子樹都是大…

AWK常用命令華(1)

awk 調用: 1.調用awk:

AWk的調用精華

awk 的調用方式 awk 提供了適應多種需要的不同解決方案,它們是: 一、awk 命令行,你可以象使用普通UNIX 命令一樣使用awk,在命令行中你也可以使用awk 程序設計語言,雖然awk 支持多行的錄入,但是錄入長長的命令行并保證其正 確無誤卻是一件令人頭疼的事,因此,這種方法一般…

算法(16)-leetcode-explore-learn-數據結構-二叉樹總結

leetcode-explore-learn-數據結構-二叉樹3本系列博文為leetcode-explore-learn子欄目學習筆記&#xff0c;如有不詳之處&#xff0c;請參考leetcode官網&#xff1a;https://leetcode-cn.com/explore/learn/card/data-structure-binary-tree/2/traverse-a-tree/7/所有例題的編程…

leetcode15 三數之和

給定一個包含 n 個整數的數組 nums&#xff0c;判斷 nums 中是否存在三個元素 a&#xff0c;b&#xff0c;c &#xff0c;使得 a b c 0 &#xff1f;找出所有滿足條件且不重復的三元組。 注意&#xff1a;答案中不可以包含重復的三元組。 例如, 給定數組 nums [-1, 0, 1,…

AWK再次認識--內置的參數,以及編寫腳本

原本這是篇給公司內同事寫的培訓文章&#xff0c;對于初學awk的人還蠻有幫助&#xff0c;貼到這里與大家共享一下。 〇、前言 意見反饋&#xff0c;請mailto:datouwanggmail.com。 一、AWK簡介 AWK名字來源于三位創造者Aho、Weinberger和Kernighan統稱。 AWK擅長處理文本數據。…

AWk高級編程

首先再說一說awk的工作流程還是有必要的 : 執行awk時, 它會反復進行下列四步驟. 1. 自動從指定的數據文件中讀取一個數據行. 2. 自動更新(Update)相關的內建變量之值. 如 : NF, NR, $0... 3. 依次執行程序中所有 的 Pattern { Actions } 指令. 4. 當執行完程序中所有 Pattern {…

leetcode19. 刪除鏈表的倒數第N個節點

給定一個鏈表&#xff0c;刪除鏈表的倒數第 n 個節點&#xff0c;并且返回鏈表的頭結點。 示例&#xff1a; 給定一個鏈表: 1->2->3->4->5, 和 n 2. 當刪除了倒數第二個節點后&#xff0c;鏈表變為 1->2->3->5. 說明&#xff1a; 給定的 n 保證是有效…

python模塊(5)-Matplotlib 簡易使用教程

Matplotlib簡易使用教程0.matplotlib的安裝1.導入相關庫2.畫布初始化2.1 隱式創建2.2 顯示創建2.3 設置畫布大小2.4 plt.figure()常用參數3.plt. 能夠繪制圖像類型3.1等高線3.2 箭頭arrow4.簡單繪制小demodemo1.曲線圖demo2-柱狀、餅狀、曲線子圖5.plt.plot()--設置曲線顏色,粗…

random_shuffle 和transform算法

1)STL中的函數random_shuffle()用來對一個元素序列進行重新排序(隨機的),函數原型如下: std::random_shuffle

C語言字符輸出格式化

符號屬性 長度屬性 基本型 所占 位數 取值范圍 輸入符舉例 輸出符舉例 -- -- char 8 -2^7 ~ 2^7-1 %c %c、%d、%u signed -- char 8 -2^7 ~ 2^7-1 %c %c、%d、%u unsigned -- char 8 0 ~ 2^8-1 %c %c、%d、%u [signed] short [int] 16 -2^15 ~ 2^15-1 %hd %hd unsigned short […

leetcode20 有效的括號

給定一個只包括 (&#xff0c;)&#xff0c;{&#xff0c;}&#xff0c;[&#xff0c;] 的字符串&#xff0c;判斷字符串是否有效。 有效字符串需滿足&#xff1a; 左括號必須用相同類型的右括號閉合。 左括號必須以正確的順序閉合。 注意空字符串可被認為是有效字符串。 示…

python模塊(6)-Pandas 簡易使用教程

Pandas 簡易教程1.Pandas簡介2.創建2.1創建dataFrame2.2創建Series3.dataframe數據訪問3.1 獲取一列--列標簽3.2 獲取多列--列標簽列表3.3 獲取一行--行標簽.loc()3.4 獲取多行--行切片操作.loc()3.5 index 獲取行列信息--df.iloc()3.6 獲取一個元素3.7 布爾值選擇數據4.datafr…

windows 如何查看端口占用情況?

開始--運行--cmd 進入命令提示符 輸入netstat -ano 即可看到所有連接的PID 之后在任務管理器中找到這個PID所對應的程序如果任務管理器中沒有PID這一項,可以在任務管理器中選"查看"-"選擇列" 經常,我們在啟動應用的時候發現系統需要的端口被別的…

泛型lua的for循環以及lua的特殊的dowhile循環

范型for循環&#xff1a; -- print all values of array a a{1,2,3,4,5,6,7}; for i,v in ipairs(a) do print(v) end 范型for遍歷迭代子函數返回的每一個值。 再看一個遍歷表key的例子&#xff1a; -- print all keys of table t map {["gaoke"]1,["gaoxin&…