research
▸ Networking for AI
In-network computing and network-level optimization of parallelism strategies for distributed machine learning.
A growing line of my work studies how to design networking mechanisms tailored to the needs of AI workloads. This includes leveraging programmable network elements to perform computation inside the network itself, rather than only at the endpoints, as well as optimizing the network-level parallelism strategies used in distributed machine learning training. This raises new combinatorial optimization problems: how to place and schedule in-network computation, and how to coordinate communication between training workers, so as to minimize network utilization and congestion subject to bounded resources, while providing provable performance guarantees.2025
- NAICT3P: Topology-Tailored Tensor ParallelismS. Ben-Yohana, C. Avin, and G. ScalosubProceedings of the 2nd Workshop on Networks for AI Computing (NAIC), 2025
As deep learning models continue to grow in scale and complexity, methods of distributed machine learning training, and particularly those used for large language models (LLMs), have become a critical ingredient in making such computations efficient and feasible. In such contexts, tensor parallelism (TP) is widely employed to distribute computations across multiple accelerators. However, since TP mandates frequent and high-volume communication between devices, the underlying network characteristics significantly influence performance. Previous work was mostly either model-agnostic or topology-agnostic and did not pick provably optimal configurations. This study presents Topology-Tailored Tensor Parallelism, T3P, an efficient algorithm that identifies the communication-optimal TP sharding configuration (within the considered search space) based on both the model architecture and the network topology. In particular, we show that T3P is optimal for any given resharding cost model.
@inproceedings{benyohana2025tttp, bibtex_show = {true}, abbr = {NAIC}, themes = {networking_ai}, author = {Ben-Yohana, S. and Avin, C. and Scalosub, G.}, title = {{T3P}: Topology-Tailored Tensor Parallelism}, booktitle = {Proceedings of the 2nd Workshop on Networks for AI Computing (NAIC)}, pages = {81--88}, year = {2025}, abstract = {As deep learning models continue to grow in scale and complexity, methods of distributed machine learning training, and particularly those used for large language models (LLMs), have become a critical ingredient in making such computations efficient and feasible. In such contexts, tensor parallelism (TP) is widely employed to distribute computations across multiple accelerators. However, since TP mandates frequent and high-volume communication between devices, the underlying network characteristics significantly influence performance. Previous work was mostly either model-agnostic or topology-agnostic and did not pick provably optimal configurations. This study presents Topology-Tailored Tensor Parallelism, T3P, an efficient algorithm that identifies the communication-optimal TP sharding configuration (within the considered search space) based on both the model architecture and the network topology. In particular, we show that T3P is optimal for any given resharding cost model.}, selected = {true} }
2024
- TNSMSOAR: minimizing network utilization with bounded in-network computingR. Segal, C. Avin, and G. ScalosubIEEE Transactions on Network and Service Management, 2024
In-network computing via smart networking devices is a recent trend in modern datacenter networks. State-of-the-art switches with near line-rate computing and aggregation capabilities enable acceleration and improved resource utilization for modern applications like large-scale distributed and federated machine learning, as well as big data analytics. We study the problem of activating a \em limited number of in-network computing devices within a network, aiming at reducing the overall cost incurred by such a deployment. Such limitations on the number of in-network computing elements arise, e.g., in incremental upgrades of network infrastructure, and are also due to requiring specialized middleboxes, or FPGAs, for supporting heterogeneous workloads, and multiple tenants. We present an efficient optimal algorithm for placing such devices in tree networks with arbitrary link rates, and further evaluate its performance in various scenarios and for various tasks, including federated/distributed ML and big data analytics. Our results show that even a small fraction of network devices supporting in-network aggregation leads to a significant reduction in network utilization cost. Furthermore, we show that various intuitive strategies for performing such placements are significantly inferior compared with our solution, for varying workloads, tasks, and link rates.
@article{segal2024soar, bibtex_show = {true}, abbr = {TNSM}, themes = {networking_ai}, title = {SOAR: minimizing network utilization with bounded in-network computing}, author = {Segal, R. and Avin, C. and Scalosub, G.}, journal = {IEEE Transactions on Network and Service Management}, volume = {21}, number = {2}, pages = {1832--1851}, year = {2024}, abstract = {In-network computing via smart networking devices is a recent trend in modern datacenter networks. State-of-the-art switches with near line-rate computing and aggregation capabilities enable acceleration and improved resource utilization for modern applications like large-scale distributed and federated machine learning, as well as big data analytics. We study the problem of activating a {\em limited number} of in-network computing devices within a network, aiming at reducing the overall cost incurred by such a deployment. Such limitations on the number of in-network computing elements arise, e.g., in incremental upgrades of network infrastructure, and are also due to requiring specialized middleboxes, or FPGAs, for supporting heterogeneous workloads, and multiple tenants. We present an efficient optimal algorithm for placing such devices in tree networks with arbitrary link rates, and further evaluate its performance in various scenarios and for various tasks, including federated/distributed ML and big data analytics. Our results show that even a small fraction of network devices supporting in-network aggregation leads to a significant reduction in network utilization cost. Furthermore, we show that various intuitive strategies for performing such placements are significantly inferior compared with our solution, for varying workloads, tasks, and link rates.}, selected = {true} }
2022
- INFOCOMConstrained In-network Computing with Low Congestion in Datacenter NetworksR. Segal, C. Avin, and G. ScalosubProceedings of IEEE INFOCOM 2022, the 41st Annual Joint Conference of the IEEE Computer and Communications Societies, 2022
Distributed computing has become a common practice nowadays, where recent focus has been given to the usage of smart networking devices with in-network computing capabilities. State-of-the-art switches with near-line rate computing and aggregation capabilities enable acceleration and improved performance for various modern applications like big data analytics and large-scale distributed and federated machine learning. In this work, we formulate and study the theoretical algorithmic foundations of such approaches, and focus on how to deploy and use constrained in-network computing capabilities within the data center. We focus our attention on reducing the network congestion, i.e., the most congested link in the network, while supporting the given workload(s). We present an efficient optimal algorithm for tree-like network topologies and show that our solution provides as much as an x13 improvement over common alternative approaches. In particular, our results show that having merely a small fraction of network devices that support in-network aggregation can significantly reduce the network congestion, both for single and multiple workloads.
@inproceedings{segal2022constrained, bibtex_show = {true}, abbr = {INFOCOM}, themes = {networking_ai}, title = {Constrained In-network Computing with Low Congestion in Datacenter Networks}, author = {Segal, R. and Avin, C. and Scalosub, G.}, booktitle = {Proceedings of IEEE INFOCOM 2022, the 41st Annual Joint Conference of the IEEE Computer and Communications Societies}, pages = {1639--1648}, year = {2022}, abstract = {Distributed computing has become a common practice nowadays, where recent focus has been given to the usage of smart networking devices with in-network computing capabilities. State-of-the-art switches with near-line rate computing and aggregation capabilities enable acceleration and improved performance for various modern applications like big data analytics and large-scale distributed and federated machine learning. In this work, we formulate and study the theoretical algorithmic foundations of such approaches, and focus on how to deploy and use constrained in-network computing capabilities within the data center. We focus our attention on reducing the network congestion, i.e., the most congested link in the network, while supporting the given workload(s). We present an efficient optimal algorithm for tree-like network topologies and show that our solution provides as much as an x13 improvement over common alternative approaches. In particular, our results show that having merely a small fraction of network devices that support in-network aggregation can significantly reduce the network congestion, both for single and multiple workloads.}, selected = {true} }
2021
- CoNEXTSOAR: minimizing network utilization with bounded in-network computingR. Segal, C. Avin, and G. ScalosubProceedings of CoNEXT 2021, the 17th International Conference on emerging Networking EXperiments and Technologies, 2021
In-network computing via smart networking devices is a recent trend for modern datacenter networks. State-of-the-art switches with near line rate computing and aggregation capabilities are developed to enable, e.g., acceleration and better utilization for modern applications like big data analytics, and large scale distributed and federated machine learning. We formulate and study the problem of activating a limited number of in-network computing devices within a network, aiming at reducing the overall network utilization for a given workload. Such limitations on the number of in-network computing elements per workload arise, e.g., in incremental upgrades of network infrastructure, and are also due to requiring specialized middleboxes, or FPGAs, that should support heterogeneous workloads, and multiple tenants. We present an optimal and efficient algorithm for placing such devices in tree networks with arbitrary link rates, and further evaluate our proposed solution in various scenarios and for various tasks. Our results show that having merely a small fraction of network devices support in-network aggregation can lead to a significant reduction in network utilization. Furthermore, we show that various intuitive strategies for performing such placements exhibit significantly inferior performance compared to our solution, for varying workloads, tasks, and link rates.
@inproceedings{segal2021soar, bibtex_show = {true}, abbr = {CoNEXT}, themes = {networking_ai}, title = {SOAR: minimizing network utilization with bounded in-network computing}, author = {Segal, R. and Avin, C. and Scalosub, G.}, booktitle = {Proceedings of CoNEXT 2021, the 17th International Conference on emerging Networking EXperiments and Technologies}, pages = {16--29}, abstract = {In-network computing via smart networking devices is a recent trend for modern datacenter networks. State-of-the-art switches with near line rate computing and aggregation capabilities are developed to enable, e.g., acceleration and better utilization for modern applications like big data analytics, and large scale distributed and federated machine learning. We formulate and study the problem of activating a limited number of in-network computing devices within a network, aiming at reducing the overall network utilization for a given workload. Such limitations on the number of in-network computing elements per workload arise, e.g., in incremental upgrades of network infrastructure, and are also due to requiring specialized middleboxes, or FPGAs, that should support heterogeneous workloads, and multiple tenants. We present an optimal and efficient algorithm for placing such devices in tree networks with arbitrary link rates, and further evaluate our proposed solution in various scenarios and for various tasks. Our results show that having merely a small fraction of network devices support in-network aggregation can lead to a significant reduction in network utilization. Furthermore, we show that various intuitive strategies for performing such placements exhibit significantly inferior performance compared to our solution, for varying workloads, tasks, and link rates.}, year = {2021} }
▸ Caching
Algorithms and competitive analysis for network caching under limited or imprecise knowledge.
One of my longest-running research threads: designing and analyzing algorithms for network caching, including LPM caching for forwarding tables, indicator-based caching with bandwidth constraints and false-negative awareness, latency-aware caching with delayed hits, and access strategies under imprecise knowledge. Much of this work provides worst-case (competitive ratio) performance guarantees alongside practical evaluation.2026
- NSDILatency-Aware Caching with Delayed Hits: From Bursty Traffic to Pipeline ArchitecturesN. Keren, G. Einziger, and G. ScalosubProceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2026
Modern computing systems rely on caching to reduce access latency and optimize resource utilization. However, in heterogeneous storage and cloud environments, non-uniform access latencies across storage tiers, network locations, and intermediary caches undermine traditional caching. Moreover, modern cache algorithms that attempt to capture multiple access patterns, recency, frequency, and burstiness, often become complex and difficult to maintain. As a key contribution, we propose an adaptive caching architecture that treats caching strategies as a pipeline of simple, orthogonal policies, each focused on a distinct access bias. This modular design is easier to expand, debug, and integrate, and it self-adjusts the memory resources allocated to each stage to optimize overall workload performance. New heuristics can be introduced dynamically without disrupting existing behaviors. In addition, in latency-aware caching, one often encounters the phenomenon of delayed hits, where items not yet available in the cache are requested repeatedly. We introduce the Least Bursty Used (LBU) heuristic, which retains items exhibiting high burstiness even when they are neither recent nor frequent, thereby mitigating delayed hits that degrade request latency. We embed LBU within our pipeline and derive the Recency-Frequency-Burstiness (RFB) policy, which balances resources among recency, frequency, and burstiness. Evaluations on thirteen real-world storage traces from IBM, Twitter and Meta using latencies drawn from real-life deployments show that RFB reduces average request latency by 10\% compared to the best state-of-the-art alternative, while maintaining consistent performance, with a low standard deviation across bursty and non-bursty workloads.
@inproceedings{keren2026latency, bibtex_show = {true}, abbr = {NSDI}, themes = {caching}, title = {Latency-Aware Caching with Delayed Hits: From Bursty Traffic to Pipeline Architectures}, author = {Keren, N. and Einziger, G. and Scalosub, G.}, booktitle = {Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI)}, pages = {2389--2405}, year = {2026}, abstract = {Modern computing systems rely on caching to reduce access latency and optimize resource utilization. However, in heterogeneous storage and cloud environments, non-uniform access latencies across storage tiers, network locations, and intermediary caches undermine traditional caching. Moreover, modern cache algorithms that attempt to capture multiple access patterns, recency, frequency, and burstiness, often become complex and difficult to maintain. As a key contribution, we propose an adaptive caching architecture that treats caching strategies as a pipeline of simple, orthogonal policies, each focused on a distinct access bias. This modular design is easier to expand, debug, and integrate, and it self-adjusts the memory resources allocated to each stage to optimize overall workload performance. New heuristics can be introduced dynamically without disrupting existing behaviors. In addition, in latency-aware caching, one often encounters the phenomenon of delayed hits, where items not yet available in the cache are requested repeatedly. We introduce the Least Bursty Used (LBU) heuristic, which retains items exhibiting high burstiness even when they are neither recent nor frequent, thereby mitigating delayed hits that degrade request latency. We embed LBU within our pipeline and derive the Recency-Frequency-Burstiness (RFB) policy, which balances resources among recency, frequency, and burstiness. Evaluations on thirteen real-world storage traces from IBM, Twitter and Meta using latencies drawn from real-life deployments show that RFB reduces average request latency by 10\% compared to the best state-of-the-art alternative, while maintaining consistent performance, with a low standard deviation across bursty and non-bursty workloads.}, selected = {true} }
2023
- SIGMETRICSGo-to-Controller is Better: Efficient and Optimal LPM Caching with SplicingI. Gozlan, C. Avin, G. Einziger, and G. ScalosubProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2023
Data center networks must support huge forwarding policies as they handle the traffic of the various tenants. Since such policies cannot be stored within the limited memory available at commodity switches, SDN controllers can manage the memory available at the switch as a cache, updating and changing the forwarding rules in the cache according to the policy and workloads dynamics. Most policies, such as Longest-prefix-match (LPM) policies, include dependencies between the forwarding rules, which introduce consistency constraints on the structure of the cached content, affecting the performance in terms of throughput and delay. Previous work suggested the concept of splicing to address such deficiencies, where modified Go-to-Controller rules can be inserted into the cache to improve performance while maintaining consistency. We present the first optimal algorithm for determining the cache content with splicing, as well as several efficient heuristics with some performance guarantees. We evaluate our solutions using traces derived from real systems and traffic, and show that splicing can reduce the cache miss ratio by as much as 30%, without increasing the cache size. We further propose a new metric which can provide a quick estimate as to the potential benefits of splicing compared to classical LPM-caching.
@article{gozlan2023goto, bibtex_show = {true}, abbr = {SIGMETRICS}, themes = {caching}, title = {Go-to-Controller is Better: Efficient and Optimal LPM Caching with Splicing}, author = {Gozlan, I. and Avin, C. and Einziger, G. and Scalosub, G.}, journal = {Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS)}, volume = {7}, number = {1}, pages = {10:1--10:33}, year = {2023}, abstract = {Data center networks must support huge forwarding policies as they handle the traffic of the various tenants. Since such policies cannot be stored within the limited memory available at commodity switches, SDN controllers can manage the memory available at the switch as a cache, updating and changing the forwarding rules in the cache according to the policy and workloads dynamics. Most policies, such as Longest-prefix-match (LPM) policies, include dependencies between the forwarding rules, which introduce consistency constraints on the structure of the cached content, affecting the performance in terms of throughput and delay. Previous work suggested the concept of splicing to address such deficiencies, where modified Go-to-Controller rules can be inserted into the cache to improve performance while maintaining consistency. We present the first optimal algorithm for determining the cache content with splicing, as well as several efficient heuristics with some performance guarantees. We evaluate our solutions using traces derived from real systems and traffic, and show that splicing can reduce the cache miss ratio by as much as 30%, without increasing the cache size. We further propose a new metric which can provide a quick estimate as to the potential benefits of splicing compared to classical LPM-caching.}, selected = {true} } - SYSTOROn Latency Awareness with Delayed Hits (Poster)N. Keren, G. Einziger, and G. ScalosubProceedings of the 16th ACM International Conference on Systems and Storage (SYSTOR), 2023
We consider a new locality pattern in the form of burstiness to improve cache effectiveness in workflows where items are requested in possibly infrequent yet costly batches. Adding a cache that handles only bursty items to existing State-Of-The-Art algorithms shows a significant improvement in overall average time per query.
@inproceedings{keren2023latency, bibtex_show = {true}, abbr = {SYSTOR}, themes = {caching}, title = {On Latency Awareness with Delayed Hits (Poster)}, author = {Keren, N. and Einziger, G. and Scalosub, G.}, booktitle = {Proceedings of the 16th ACM International Conference on Systems and Storage (SYSTOR)}, pages = {144}, abstract = {We consider a new locality pattern in the form of burstiness to improve cache effectiveness in workflows where items are requested in possibly infrequent yet costly batches. Adding a cache that handles only bursty items to existing State-Of-The-Art algorithms shows a significant improvement in overall average time per query.}, year = {2023} }
2022
- ToNFalse Negative Awareness in Indicator-Based Caching SystemsI. Cohen, G. Einziger, and G. ScalosubIEEE/ACM Transactions on Networking, 2022
Distributed caching systems such as content distribution networks often advertise their content via lightweight approximate indicators (e.g., Bloom filters) to efficiently inform clients where each datum is likely cached. While false-positive indications are necessary and well understood, most existing works assume no false-negative indications. Our work illustrates practical scenarios where false-negatives are unavoidable and ignoring them significantly impacts system performance. Specifically, we focus on false-negatives induced by indicator staleness, which arises whenever the system advertises the indicator only periodically, rather than immediately reporting every change in the cache. Such scenarios naturally occur, e.g., in bandwidth-constraint environments or when latency impedes each client's ability to obtain an updated indicator. Our work introduces novel false-negative aware access policies that continuously estimate the false-negative ratio and sometimes access caches despite negative indications. We present optimal policies for homogeneous settings and provide approximation guarantees for our algorithms in heterogeneous environments. We further perform an extensive simulation study with multiple real system traces. We show that our false-negative aware algorithms incur a significantly lower service cost than existing approaches or match the cost of these approaches while requiring an order of magnitude fewer resources (e.g., caching capacity or bandwidth).
@article{cohen2022falsenegative, bibtex_show = {true}, abbr = {ToN}, themes = {caching}, title = {False Negative Awareness in Indicator-Based Caching Systems}, author = {Cohen, I. and Einziger, G. and Scalosub, G.}, journal = {IEEE/ACM Transactions on Networking}, volume = {30}, number = {6}, pages = {2674--2687}, year = {2022}, abstract = {Distributed caching systems such as content distribution networks often advertise their content via lightweight approximate indicators (e.g., Bloom filters) to efficiently inform clients where each datum is likely cached. While false-positive indications are necessary and well understood, most existing works assume no false-negative indications. Our work illustrates practical scenarios where false-negatives are unavoidable and ignoring them significantly impacts system performance. Specifically, we focus on false-negatives induced by indicator staleness, which arises whenever the system advertises the indicator only periodically, rather than immediately reporting every change in the cache. Such scenarios naturally occur, e.g., in bandwidth-constraint environments or when latency impedes each client's ability to obtain an updated indicator. Our work introduces novel false-negative aware access policies that continuously estimate the false-negative ratio and sometimes access caches despite negative indications. We present optimal policies for homogeneous settings and provide approximation guarantees for our algorithms in heterogeneous environments. We further perform an extensive simulation study with multiple real system traces. We show that our false-negative aware algorithms incur a significantly lower service cost than existing approaches or match the cost of these approaches while requiring an order of magnitude fewer resources (e.g., caching capacity or bandwidth).}, selected = {true} }
2021
- ToNAccess Strategies for Network CachingI. Cohen, G. Einziger, R. Friedman, and G. ScalosubIEEE/ACM Transactions on Networking, 2021
Having multiple data stores that can potentially serve content is common in modern networked applications. Data stores often publish approximate summaries of their content to enable effective utilization. Since these summaries are not entirely accurate, forming an efficient access strategy to multiple data stores becomes a complex risk management problem. This paper formally models this problem as a cost minimization problem, while taking into account both access costs, the inaccuracy of the approximate summaries, as well as the penalties incurred by failed requests. We introduce practical algorithms with guaranteed approximation ratios and further show that they are optimal in various settings. We also perform an extensive simulation study based on real data and show that our algorithms are more robust than existing heuristics. That is, they exhibit near-optimal performance in various settings, whereas the efficiency of existing approaches depends upon system parameters that may change over time, or be otherwise unknown.
@article{cohen2021access, bibtex_show = {true}, abbr = {ToN}, themes = {caching}, title = {Access Strategies for Network Caching}, author = {Cohen, I. and Einziger, G. and Friedman, R. and Scalosub, G.}, journal = {IEEE/ACM Transactions on Networking}, volume = {29}, number = {2}, pages = {609--622}, abstract = {Having multiple data stores that can potentially serve content is common in modern networked applications. Data stores often publish approximate summaries of their content to enable effective utilization. Since these summaries are not entirely accurate, forming an efficient access strategy to multiple data stores becomes a complex risk management problem. This paper formally models this problem as a cost minimization problem, while taking into account both access costs, the inaccuracy of the approximate summaries, as well as the penalties incurred by failed requests. We introduce practical algorithms with guaranteed approximation ratios and further show that they are optimal in various settings. We also perform an extensive simulation study based on real data and show that our algorithms are more robust than existing heuristics. That is, they exhibit near-optimal performance in various settings, whereas the efficiency of existing approaches depends upon system parameters that may change over time, or be otherwise unknown.}, year = {2021} } - ICDCSOn the Power of False Negative Awareness in Indicator-based Caching SystemsI. Cohen, G. Einziger, and G. ScalosubProceedings of ICDCS 2021, the 41st IEEE International Conference on Distributed Computing Systems, 2021
Distributed caching systems such as content distribution networks often advertise their content via lightweight approximate indicators (e.g., Bloom filters) to efficiently inform clients where each datum is likely cached. While false-positive indications are necessary and well understood, most existing works assume no false-negative indications. Our work illustrates practical scenarios where false-negatives are unavoidable and ignoring them has a significant impact on system performance. Specifically, we focus on false-negatives induced by indicator staleness, which arises whenever the system advertises the indicator only periodically, rather than immediately reporting every change in the cache. Such scenarios naturally occur, e.g., in bandwidth-constraint environments or when latency impedes each client's ability to obtain an updated indicator. Our work introduces novel false-negative aware access policies that continuously estimate the false-negative ratio and sometimes access caches despite negative indications. We present optimal policies for homogeneous settings and provide approximation guarantees for our algorithms in heterogeneous environments. We further perform an extensive simulation study with multiple real system traces. We show that our false-negative aware algorithms incur a significantly lower access cost than existing approaches or match the cost of these approaches while requiring an order of magnitude fewer resources (e.g., caching capacity or bandwidth).
@inproceedings{cohen2021power, bibtex_show = {true}, abbr = {ICDCS}, themes = {caching}, title = {On the Power of False Negative Awareness in Indicator-based Caching Systems}, author = {Cohen, I. and Einziger, G. and Scalosub, G.}, booktitle = {Proceedings of ICDCS 2021, the 41st IEEE International Conference on Distributed Computing Systems}, pages = {46--56}, abstract = {Distributed caching systems such as content distribution networks often advertise their content via lightweight approximate indicators (e.g., Bloom filters) to efficiently inform clients where each datum is likely cached. While false-positive indications are necessary and well understood, most existing works assume no false-negative indications. Our work illustrates practical scenarios where false-negatives are unavoidable and ignoring them has a significant impact on system performance. Specifically, we focus on false-negatives induced by indicator staleness, which arises whenever the system advertises the indicator only periodically, rather than immediately reporting every change in the cache. Such scenarios naturally occur, e.g., in bandwidth-constraint environments or when latency impedes each client's ability to obtain an updated indicator. Our work introduces novel false-negative aware access policies that continuously estimate the false-negative ratio and sometimes access caches despite negative indications. We present optimal policies for homogeneous settings and provide approximation guarantees for our algorithms in heterogeneous environments. We further perform an extensive simulation study with multiple real system traces. We show that our false-negative aware algorithms incur a significantly lower access cost than existing approaches or match the cost of these approaches while requiring an order of magnitude fewer resources (e.g., caching capacity or bandwidth).}, year = {2021} } - INFOCOMSelf-adjusting Advertisement of Cache Indicators with Bandwidth ConstraintsI. Cohen, G. Einziger, and G. ScalosubProceedings of IEEE INFOCOM 2021, the 40th Annual Joint Conference of the IEEE Computer and Communications Societies, 2021
Cache advertisements reduce the access cost by allowing users to skip the cache when it does not contain their datum. Such advertisements are used in multiple networked domains such as 5G networks, wide area networks, and information-centric networking. The selection of an advertisement strategy exposes a trade-off between the access cost and bandwidth consumption. Still, existing works mostly apply a trial-and-error approach for selecting the best strategy, as the rigorous foundations required for optimizing such decisions is lacking.Our work shows that the desired advertisement policy depends on numerous parameters such as the cache policy, the workload, the cache size, and the available bandwidth. In particular, we show that there is no ideal single configuration. Therefore, we design an adaptive, self-adjusting algorithm that periodically selects an advertisement policy. Our algorithm does not require any prior information about the cache policy, cache size, or work-load, and does not require any apriori configuration. Through extensive simulations, using several state-of-the-art cache policies, and real workloads, we show that our approach attains a similar cost to that of the best static configuration (which is only identified in retrospect) in each case.
@inproceedings{cohen2021selfadjusting, bibtex_show = {true}, abbr = {INFOCOM}, themes = {caching}, title = {Self-adjusting Advertisement of Cache Indicators with Bandwidth Constraints}, author = {Cohen, I. and Einziger, G. and Scalosub, G.}, booktitle = {Proceedings of IEEE INFOCOM 2021, the 40th Annual Joint Conference of the IEEE Computer and Communications Societies}, year = {2021}, abstract = {Cache advertisements reduce the access cost by allowing users to skip the cache when it does not contain their datum. Such advertisements are used in multiple networked domains such as 5G networks, wide area networks, and information-centric networking. The selection of an advertisement strategy exposes a trade-off between the access cost and bandwidth consumption. Still, existing works mostly apply a trial-and-error approach for selecting the best strategy, as the rigorous foundations required for optimizing such decisions is lacking.Our work shows that the desired advertisement policy depends on numerous parameters such as the cache policy, the workload, the cache size, and the available bandwidth. In particular, we show that there is no ideal single configuration. Therefore, we design an adaptive, self-adjusting algorithm that periodically selects an advertisement policy. Our algorithm does not require any prior information about the cache policy, cache size, or work-load, and does not require any apriori configuration. Through extensive simulations, using several state-of-the-art cache policies, and real workloads, we show that our approach attains a similar cost to that of the best static configuration (which is only identified in retrospect) in each case.}, additional_info = {Also appeared in the Highlights Track in SYSTOR 2021.} }
2019
- INFOCOMAccess Strategies for Network CachingI. Cohen, G. Einziger, R. Friedman, and G. ScalosubProceedings of IEEE INFOCOM 2019, the 38th Annual Joint Conference of the IEEE Computer and Communications Societies, 2019
Having multiple data stores that can potentially serve content is common in modern networked applications. Data stores often publish approximate summaries of their content to enable effective utilization. Since these summaries are not entirely accurate, forming an efficient access strategy to multiple data stores becomes a complex risk management problem. This paper formally models this problem, and introduces practical algorithms with guaranteed approximation ratios, and in particular we show that our algorithms are optimal in a variety of settings. We also perform an extensive simulation study based on real data, and show that our algorithms are more robust than existing heuristics. That is, they exhibit near optimal performance in various settings whereas the efficiency of existing approaches depends upon system parameters that may change over time, or be otherwise unknown.
@inproceedings{cohen2019access, bibtex_show = {true}, abbr = {INFOCOM}, themes = {caching}, title = {Access Strategies for Network Caching}, author = {Cohen, I. and Einziger, G. and Friedman, R. and Scalosub, G.}, booktitle = {Proceedings of IEEE INFOCOM 2019, the 38th Annual Joint Conference of the IEEE Computer and Communications Societies}, pages = {28--36}, year = {2019}, pdf = {publications/cohen19access.pdf}, abstract = {Having multiple data stores that can potentially serve content is common in modern networked applications. Data stores often publish approximate summaries of their content to enable effective utilization. Since these summaries are not entirely accurate, forming an efficient access strategy to multiple data stores becomes a complex risk management problem. This paper formally models this problem, and introduces practical algorithms with guaranteed approximation ratios, and in particular we show that our algorithms are optimal in a variety of settings. We also perform an extensive simulation study based on real data, and show that our algorithms are more robust than existing heuristics. That is, they exhibit near optimal performance in various settings whereas the efficiency of existing approaches depends upon system parameters that may change over time, or be otherwise unknown.} }
▸ Buffer Management & Scheduling
Competitive analysis of buffer management and scheduling algorithms under limited knowledge and resource constraints.
One of my longest-running research threads: designing and analyzing algorithms for buffer management and scheduling under limited or imprecise knowledge, often with inter-packet dependencies. Topics include scheduling with bounded delay, buffer management for aggregated streaming data, queueing with limited knowledge, performance guarantees for multipass network processors, and competitive analysis of buffer policies with SLA commitments. Much of this work provides worst-case (competitive ratio) performance guarantees alongside practical evaluation.2022
- CompCommunBounded Delay Scheduling with Packet DependenciesM. Markovitch, and G. ScalosubComputer Communications, 2022
A common situation occurring when dealing with multimedia traffic is having large data frames fragmented into smaller IP packets, and having these packets sent independently through the network. For real-time multimedia traffic, dropping even few packets of a frame may render the entire frame useless. Such traffic is usually modeled as having inter-packet dependencies. We study the problem of scheduling traffic with such dependencies, where each packet has a deadline by which it should arrive at its destination. Such deadlines are common for real-time multimedia applications, and are derived from stringent delay constraints posed by the application. The figure of merit in such environments is maximizing the system's goodput, namely, the number of frames successfully delivered. We study online algorithms for the problem of maximizing goodput of delay-bounded traffic with inter-packet dependencies, and use competitive analysis to evaluate their performance. We present competitive algorithms for the problem, as well as matching lower bounds that are tight up to a constant factor. We further present the results of a simulation study which further validates our algorithmic approach and shows that insights arising from our analysis are indeed manifested in practice.
@article{markovitch2022bounded, bibtex_show = {true}, abbr = {CompCommun}, themes = {buffer_scheduling}, title = {Bounded Delay Scheduling with Packet Dependencies}, author = {Markovitch, M. and Scalosub, G.}, journal = {Computer Communications}, volume = {182}, pages = {98--109}, abstract = {A common situation occurring when dealing with multimedia traffic is having large data frames fragmented into smaller IP packets, and having these packets sent independently through the network. For real-time multimedia traffic, dropping even few packets of a frame may render the entire frame useless. Such traffic is usually modeled as having inter-packet dependencies. We study the problem of scheduling traffic with such dependencies, where each packet has a deadline by which it should arrive at its destination. Such deadlines are common for real-time multimedia applications, and are derived from stringent delay constraints posed by the application. The figure of merit in such environments is maximizing the system's goodput, namely, the number of frames successfully delivered. We study online algorithms for the problem of maximizing goodput of delay-bounded traffic with inter-packet dependencies, and use competitive analysis to evaluate their performance. We present competitive algorithms for the problem, as well as matching lower bounds that are tight up to a constant factor. We further present the results of a simulation study which further validates our algorithmic approach and shows that insights arising from our analysis are indeed manifested in practice.}, year = {2022} }
2018
- CompNetQueueing in the Mist: Buffering and Scheduling with Limited KnowledgeI. Cohen, and G. ScalosubComputer Networks, 2018
Scheduling and managing queues with bounded buffers are among the most fundamental problems in computer networking. Traditionally, it is often assumed that all the properties of each packet are known immediately upon arrival. However, as traffic becomes increasingly heterogeneous and complex, such assumptions are in many cases invalid. In particular, in various scenarios information about packet characteristics becomes available only after the packet has undergone some initial processing. In this work, we study the problem of managing queues with limited knowledge. We start by showing lower bounds on the competitive ratio of any algorithm in such settings. The techniques used in our proofs, which make use of a carefully crafted Markov process, may be of independent interest, and can potentially be used in other similar settings as well. Next, we use the insight obtained from these bounds to identify several algorithmic concepts appropriate for the problem, and use these guidelines to design a concrete algorithmic framework. We analyze the performance of our proposed algorithm, and further show how it can be implemented in various settings, which differ by the type and nature of the unknown information. We further validate our results and algorithmic approach by an extensive simulation study that provides further insights as to our algorithmic design principles in face of limited knowledge.
@article{cohen2018queueing, bibtex_show = {true}, abbr = {CompNet}, themes = {buffer_scheduling}, title = {Queueing in the Mist: Buffering and Scheduling with Limited Knowledge}, author = {Cohen, I. and Scalosub, G.}, journal = {Computer Networks}, volume = {147}, pages = {204--220}, abstract = {Scheduling and managing queues with bounded buffers are among the most fundamental problems in computer networking. Traditionally, it is often assumed that all the properties of each packet are known immediately upon arrival. However, as traffic becomes increasingly heterogeneous and complex, such assumptions are in many cases invalid. In particular, in various scenarios information about packet characteristics becomes available only after the packet has undergone some initial processing. In this work, we study the problem of managing queues with limited knowledge. We start by showing lower bounds on the competitive ratio of any algorithm in such settings. The techniques used in our proofs, which make use of a carefully crafted Markov process, may be of independent interest, and can potentially be used in other similar settings as well. Next, we use the insight obtained from these bounds to identify several algorithmic concepts appropriate for the problem, and use these guidelines to design a concrete algorithmic framework. We analyze the performance of our proposed algorithm, and further show how it can be implemented in various settings, which differ by the type and nature of the unknown information. We further validate our results and algorithmic approach by an extensive simulation study that provides further insights as to our algorithmic design principles in face of limited knowledge.}, year = {2018} }
2016
- JNCALarge Profits or Fast Gains: A Dilemma in Maximizing Throughput with Applications to Network ProcessorsK. Kogan, A. Lopez-Ortiz, S. Nikolenko, G. Scalosub, and M. SegalJournal of Network and Computer Applications, 2016
We consider the fundamental problem of managing a bounded size queue buffer where traffic consists of packets of varying size, each packet requires several rounds of processing before it can be transmitted out, and the goal is to maximize the throughput, i.e., total size of successfully transmitted packets. Our work addresses the tension between two conflicting algorithmic approaches: favoring packets with fewer processing requirements as opposed to packets of larger size. We present a novel model for studying such systems and study the performance of online algorithms that aim to maximize throughput.
@article{kogan2016large, bibtex_show = {true}, abbr = {JNCA}, themes = {buffer_scheduling}, title = {Large Profits or Fast Gains: A Dilemma in Maximizing Throughput with Applications to Network Processors}, author = {Kogan, K. and Lopez-Ortiz, A. and Nikolenko, S. and Scalosub, G. and Segal, M.}, journal = {Journal of Network and Computer Applications}, volume = {74}, pages = {31--43}, abstract = {We consider the fundamental problem of managing a bounded size queue buffer where traffic consists of packets of varying size, each packet requires several rounds of processing before it can be transmitted out, and the goal is to maximize the throughput, i.e., total size of successfully transmitted packets. Our work addresses the tension between two conflicting algorithmic approaches: favoring packets with fewer processing requirements as opposed to packets of larger size. We present a novel model for studying such systems and study the performance of online algorithms that aim to maximize throughput.}, year = {2016} }
2013
- TCSCompetitive Buffer Management with Packet DependenciesA. Kesselman, B. Patt-Shamir, and G. ScalosubTheoretical Computer Science, 2013
We introduce the problem of managing a FIFO buffer of bounded space, where arriving packets have dependencies among them. Our model is motivated by the scenario where large data frames must be split into multiple packets, because maximum packet size is limited by data-link restrictions. A frame is considered useful only if sufficiently many of its constituent packets are delivered. The buffer management algorithm decides, in case of overflow, which packets to discard and which to keep in the buffer. The goal of the buffer management algorithm is to maximize throughput of useful frames. This problem has a variety of applications, e.g., Internet video streaming, where video frames are segmented and encapsulated in IP packets sent over the Internet. We study the complexity of the above problem in both the offline and online settings. We give upper and lower bounds on the performance of algorithms using competitive analysis.
@article{kesselman2013competitive, bibtex_show = {true}, abbr = {TCS}, themes = {buffer_scheduling}, title = {Competitive Buffer Management with Packet Dependencies}, author = {Kesselman, A. and Patt-Shamir, B. and Scalosub, G.}, journal = {Theoretical Computer Science}, volume = {489-490}, pages = {75--87}, abstract = {We introduce the problem of managing a FIFO buffer of bounded space, where arriving packets have dependencies among them. Our model is motivated by the scenario where large data frames must be split into multiple packets, because maximum packet size is limited by data-link restrictions. A frame is considered useful only if sufficiently many of its constituent packets are delivered. The buffer management algorithm decides, in case of overflow, which packets to discard and which to keep in the buffer. The goal of the buffer management algorithm is to maximize throughput of useful frames. This problem has a variety of applications, e.g., Internet video streaming, where video frames are segmented and encapsulated in IP packets sent over the Internet. We study the complexity of the above problem in both the offline and online settings. We give upper and lower bounds on the performance of algorithms using competitive analysis.}, year = {2013} } - TPDSBuffer Management for Aggregated Streaming Data with Packet DependenciesG. Scalosub, J. Liebeherr, and P. MarbachIEEE Transactions on Parallel and Distributed Systems, 2013
In many applications, the traffic traversing the network has interpacket dependencies due to application-level encoding schemes. For some applications, e.g., multimedia streaming, dropping a single packet may render useless the delivery of a whole sequence. In such environments, the algorithm used to decide which packet to drop in case of buffer overflows must be carefully designed, to avoid goodput degradation. We present a model that captures such interpacket dependencies, and design algorithms for performing packet discard. Traffic consists of an aggregation of multiple streams, each of which consists of a sequence of interdependent packets. We provide two guidelines for designing buffer management algorithms, and demonstrate their effectiveness. We devise an algorithm according to these guidelines and evaluate its performance analytically, using competitive analysis. We also perform a simulation study that shows that the performance of our algorithm is within a small fraction of the performance of the best known offline algorithm.
@article{scalosub2013buffer, bibtex_show = {true}, abbr = {TPDS}, themes = {buffer_scheduling}, title = {Buffer Management for Aggregated Streaming Data with Packet Dependencies}, author = {Scalosub, G. and Liebeherr, J. and Marbach, P.}, journal = {IEEE Transactions on Parallel and Distributed Systems}, volume = {24}, number = {3}, pages = {439--449}, abstract = {In many applications, the traffic traversing the network has interpacket dependencies due to application-level encoding schemes. For some applications, e.g., multimedia streaming, dropping a single packet may render useless the delivery of a whole sequence. In such environments, the algorithm used to decide which packet to drop in case of buffer overflows must be carefully designed, to avoid goodput degradation. We present a model that captures such interpacket dependencies, and design algorithms for performing packet discard. Traffic consists of an aggregation of multiple streams, each of which consists of a sequence of interdependent packets. We provide two guidelines for designing buffer management algorithms, and demonstrate their effectiveness. We devise an algorithm according to these guidelines and evaluate its performance analytically, using competitive analysis. We also perform a simulation study that shows that the performance of our algorithm is within a small fraction of the performance of the best known offline algorithm.}, year = {2013} }
2012
- ToNProviding Performance Guarantees in Multipass Network ProcessorsI. Keslassy, K. Kogan, G. Scalosub, and M. SegalIEEE/ACM Transactions on Networking, 2012
Current network processors (NPs) increasingly deal with packets with heterogeneous processing times. In such an environment, packets that require many processing cycles delay low-latency traffic because the common approach in today's NPs is to employ run-to-completion processing. These difficulties have led to the emergence of the Multipass NP architecture, where after a processing cycle ends, all processed packets are recycled into the buffer and recompete for processing resources. In this paper, we provide a model that captures many of the characteristics of this architecture, and we consider several scheduling and buffer management algorithms that are specially designed to optimize the performance of multipass network processors. In particular, we provide analytical guarantees for the throughput performance of our algorithms. We further conduct a comprehensive simulation study, which validates our results.
@article{keslassy2012providing, bibtex_show = {true}, abbr = {ToN}, themes = {buffer_scheduling}, title = {Providing Performance Guarantees in Multipass Network Processors}, author = {Keslassy, I. and Kogan, K. and Scalosub, G. and Segal, M.}, journal = {IEEE/ACM Transactions on Networking}, volume = {20}, number = {6}, pages = {1895--1909}, year = {2012}, abstract = {Current network processors (NPs) increasingly deal with packets with heterogeneous processing times. In such an environment, packets that require many processing cycles delay low-latency traffic because the common approach in today's NPs is to employ run-to-completion processing. These difficulties have led to the emergence of the Multipass NP architecture, where after a processing cycle ends, all processed packets are recycled into the buffer and recompete for processing resources. In this paper, we provide a model that captures many of the characteristics of this architecture, and we consider several scheduling and buffer management algorithms that are specially designed to optimize the performance of multipass network processors. In particular, we provide analytical guarantees for the throughput performance of our algorithms. We further conduct a comprehensive simulation study, which validates our results.}, selected = {true} }
2011
- ToARate vs. Buffer Size: Greedy Information Gathering on the LineA. Rosén, and G. ScalosubACM Transactions on Algorithms, 2011
We consider packet networks with limited buffer space at the nodes, and are interested in the question of maximizing the number of packets that arrive to destination rather than being dropped due to full buffers. We initiate a more refined analysis of the throughput competitive ratio of admission and scheduling policies in the Competitive Network Throughput model [Aiello et al. 2005], taking into account not only the network size but also the buffer size and the injection rate of the traffic. We specifically consider the problem of information gathering on the line, with limited buffer space, under adversarial traffic. We examine how the buffer size and the injection rate of the traffic affect the performance of the greedy protocol for this problem. We establish upper bounds on the competitive ratio of the greedy protocol in terms of the network size, the buffer size, and the adversary's rate, and present lower bounds which are tight up to constant factors. These results show, for example, that provisioning the network with sufficiently large buffers may substantially improve the performance of the greedy protocol in some cases, whereas for some high-rate adversaries, using larger buffers does not have any effect on the competitive ratio of the protocol.
@article{rosen2011rate, bibtex_show = {true}, abbr = {ToA}, themes = {buffer_scheduling}, title = {Rate vs. Buffer Size: Greedy Information Gathering on the Line}, author = {Rosén, A. and Scalosub, G.}, journal = {ACM Transactions on Algorithms}, volume = {7}, number = {3}, abstract = {We consider packet networks with limited buffer space at the nodes, and are interested in the question of maximizing the number of packets that arrive to destination rather than being dropped due to full buffers. We initiate a more refined analysis of the throughput competitive ratio of admission and scheduling policies in the Competitive Network Throughput model [Aiello et al. 2005], taking into account not only the network size but also the buffer size and the injection rate of the traffic. We specifically consider the problem of information gathering on the line, with limited buffer space, under adversarial traffic. We examine how the buffer size and the injection rate of the traffic affect the performance of the greedy protocol for this problem. We establish upper bounds on the competitive ratio of the greedy protocol in terms of the network size, the buffer size, and the adversary's rate, and present lower bounds which are tight up to constant factors. These results show, for example, that provisioning the network with sufficiently large buffers may substantially improve the performance of the greedy protocol in some cases, whereas for some high-rate adversaries, using larger buffers does not have any effect on the competitive ratio of the protocol.}, year = {2011} }
2010
- JDAOnline Time-Constrained Scheduling in Linear and Ring NetworksJ. Naor, A. Rosén, and G. ScalosubJournal of Discrete Algorithms, 2010
We consider the problem of scheduling a sequence of packets over a linear network, where every packet has a source and a target, as well as a release time and a deadline by which it must arrive at its target. The model we consider is bufferless, where packets are not allowed to be buffered in nodes along their paths other than at their source. This model applies to optical networks where opto-electronic conversion is costly, and packets mostly travel through bufferless hops. The offline version of this problem was previously studied in M. Adler et al. (2002). In this paper we study the online version of the problem, where we are required to schedule the packets without knowledge of future packet arrivals. We use competitive analysis to evaluate the performance of our algorithms. We present the first online algorithms for several versions of the problem. For the problem of throughput maximization, where all packets have uniform weights, we give an algorithm with a logarithmic competitive ratio, and present some lower bounds. For other weight functions, we show algorithms that achieve optimal competitive ratios.
@article{naor2010online, bibtex_show = {true}, abbr = {JDA}, themes = {buffer_scheduling}, title = {Online Time-Constrained Scheduling in Linear and Ring Networks}, author = {Naor, J. and Rosén, A. and Scalosub, G.}, journal = {Journal of Discrete Algorithms}, volume = {8}, number = {4}, pages = {346--355}, abstract = {We consider the problem of scheduling a sequence of packets over a linear network, where every packet has a source and a target, as well as a release time and a deadline by which it must arrive at its target. The model we consider is bufferless, where packets are not allowed to be buffered in nodes along their paths other than at their source. This model applies to optical networks where opto-electronic conversion is costly, and packets mostly travel through bufferless hops. The offline version of this problem was previously studied in M. Adler et al. (2002). In this paper we study the online version of the problem, where we are required to schedule the packets without knowledge of future packet arrivals. We use competitive analysis to evaluate the performance of our algorithms. We present the first online algorithms for several versions of the problem. For the problem of throughput maximization, where all packets have uniform weights, we give an algorithm with a logarithmic competitive ratio, and present some lower bounds. For other weight functions, we show algorithms that achieve optimal competitive ratios.}, year = {2010} }
2009
- ToAJitter Regulation for Multiple StreamsD. Hay, and G. ScalosubACM Transactions on Algorithms, 2009
For widely used interactive communication, it is essential that traffic is kept as smooth as possible; the smoothness of the traffic is typically captured by its delay jitter, that is, the difference between the maximal and minimal end-to-end delays. The task of minimizing the jitter is done by jitter regulators that use a limited-size buffer in order to shape the traffic. In many real-life situations regulators must handle multiple streams simultaneously and provide low jitter on each of them separately. Moreover, communication links have limited capacity, and these may pose further restrictions on the choices made by the regulator. This article investigates the problem of minimizing jitter in such an environment, using a fixed-size buffer. We show that the offline version of the problem can be solved in polynomial time, by introducing an efficient offline algorithm that finds a release schedule with optimal jitter. When regulating M streams in the online setting, we take a competitive analysis point of view and note that, in the upcapacitated case, previous results in Mansour and Patt-Shamir [2001] can be extended to an online algorithm that uses a buffer of size $2\cdot M\cdot B$ and obtains the optimal jitter possible with a buffer of size $B$ (and an offline algorithm). The question arises whether such a resource augmentation is essential. We answer this question in the affirmative, by proving a lower bound that is tight up to a factor of 2, thus showing that jitter regulation does not scale well as the number of streams increases unless the buffer is sized-up proportionally.
@article{hay2009jitter, bibtex_show = {true}, abbr = {ToA}, themes = {buffer_scheduling}, title = {Jitter Regulation for Multiple Streams}, author = {Hay, D. and Scalosub, G.}, journal = {ACM Transactions on Algorithms}, volume = {6}, number = {1}, abstract = {For widely used interactive communication, it is essential that traffic is kept as smooth as possible; the smoothness of the traffic is typically captured by its delay jitter, that is, the difference between the maximal and minimal end-to-end delays. The task of minimizing the jitter is done by jitter regulators that use a limited-size buffer in order to shape the traffic. In many real-life situations regulators must handle multiple streams simultaneously and provide low jitter on each of them separately. Moreover, communication links have limited capacity, and these may pose further restrictions on the choices made by the regulator. This article investigates the problem of minimizing jitter in such an environment, using a fixed-size buffer. We show that the offline version of the problem can be solved in polynomial time, by introducing an efficient offline algorithm that finds a release schedule with optimal jitter. When regulating M streams in the online setting, we take a competitive analysis point of view and note that, in the upcapacitated case, previous results in Mansour and Patt-Shamir [2001] can be extended to an online algorithm that uses a buffer of size $2\cdot M\cdot B$ and obtains the optimal jitter possible with a buffer of size $B$ (and an offline algorithm). The question arises whether such a resource augmentation is essential. We answer this question in the affirmative, by proving a lower bound that is tight up to a factor of 2, thus showing that jitter regulation does not scale well as the number of streams increases unless the buffer is sized-up proportionally.}, year = {2009} }
2008
- ICNPCompetitive Analysis of Buffer Policies with SLA CommitmentsB. Patt-Shamir, G. Scalosub, and Y. ShavittProceedings of ICNP 2008, the 16th IEEE International Conference on Network Protocols, 2008
We consider an abstraction of the problem of managing buffers where traffic is subject to service level agreements (SLA). In our abstraction of SLAs, some packets are marked as committed and the others are marked as excess. The service provider must on one hand deliver all committed packets, and on the other hand can get extra revenue for any excess packet delivered. We study online algorithms managing a buffer with limited space, whose task is to decide which packets should be delivered and which should be dropped. Using competitive analysis, we show how to utilize additional buffer space and link bandwidth so that the number of excess packets delivered is comparable to the best possible by any off-line algorithm, while guaranteeing that no committed packet is ever lost.
@inproceedings{pattshamir2008competitive, bibtex_show = {true}, abbr = {ICNP}, themes = {buffer_scheduling}, title = {Competitive Analysis of Buffer Policies with SLA Commitments}, author = {Patt-Shamir, B. and Scalosub, G. and Shavitt, Y.}, booktitle = {Proceedings of ICNP 2008, the 16th IEEE International Conference on Network Protocols}, pages = {197--206}, year = {2008}, pdf = {publications/pattshamir08competitive.pdf}, slides = {talks/pattshamir08competitive-slides.ppt}, abstract = {We consider an abstraction of the problem of managing buffers where traffic is subject to service level agreements (SLA). In our abstraction of SLAs, some packets are marked as committed and the others are marked as excess. The service provider must on one hand deliver all committed packets, and on the other hand can get extra revenue for any excess packet delivered. We study online algorithms managing a buffer with limited space, whose task is to decide which packets should be delivered and which should be dropped. Using competitive analysis, we show how to utilize additional buffer space and link bandwidth so that the number of excess packets delivered is comparable to the best possible by any off-line algorithm, while guaranteeing that no committed packet is ever lost.} }
▸ Edge-Cloud Resource Allocation
Service provisioning, placement, and embedding with provable guarantees in cloud and edge-cloud environments.
This thread studies resource allocation problems arising in cloud and edge-cloud systems: dynamic service provisioning across the edge-cloud continuum, VM placement with constrained communication overhead, virtual service embedding under time-varying load, and reducing service deployment cost through VNF sharing. The emphasis throughout is on algorithms with provable performance guarantees that remain practical at scale.2023
- ToNDynamic Service Provisioning in the Edge-cloud Continuum with Bounded ResourcesI. Cohen, C. Chiasserini, P. Giaccone, and G. ScalosubIEEE/ACM Transactions on Networking, 2023
We consider a hierarchical edge-cloud architecture in which services are provided to mobile users as chains of virtual network functions. Each service has specific computation requirements and target delay performance, which require placing the corresponding chain properly and allocating a suitable amount of computing resources. Furthermore, chain migration may be necessary to meet the services' target delay. We model and formalize the problem of finding a feasible chain placement and resource allocation, while minimizing the migration, bandwidth, and computation costs. We tackle this problem by partitioning it into a (i) CPU allocation problem, and a (ii) placement problem. For the CPU allocation problem, we find an optimal solution. For the placement problem, we show that even finding a feasible solution is NP-hard, and envision an algorithm that is guaranteed to find a feasible solution while leveraging a bounded amount of resource augmentation. Our algorithms are incorporated into a solution framework that aims to minimize both the cost and the required resource augmentation. The results, obtained through trace-driven, large-scale simulations, show that our framework can provide a close-to-optimal solution while running several orders of magnitude faster than an ILP solver.
@article{cohen2023dynamic, bibtex_show = {true}, abbr = {ToN}, themes = {edge_cloud}, title = {Dynamic Service Provisioning in the Edge-cloud Continuum with Bounded Resources}, author = {Cohen, I. and Chiasserini, C.-F. and Giaccone, P. and Scalosub, G.}, journal = {IEEE/ACM Transactions on Networking}, volume = {31}, number = {6}, pages = {3096--3111}, year = {2023}, abstract = {We consider a hierarchical edge-cloud architecture in which services are provided to mobile users as chains of virtual network functions. Each service has specific computation requirements and target delay performance, which require placing the corresponding chain properly and allocating a suitable amount of computing resources. Furthermore, chain migration may be necessary to meet the services' target delay. We model and formalize the problem of finding a feasible chain placement and resource allocation, while minimizing the migration, bandwidth, and computation costs. We tackle this problem by partitioning it into a (i) CPU allocation problem, and a (ii) placement problem. For the CPU allocation problem, we find an optimal solution. For the placement problem, we show that even finding a feasible solution is NP-hard, and envision an algorithm that is guaranteed to find a feasible solution while leveraging a bounded amount of resource augmentation. Our algorithms are incorporated into a solution framework that aims to minimize both the cost and the required resource augmentation. The results, obtained through trace-driven, large-scale simulations, show that our framework can provide a close-to-optimal solution while running several orders of magnitude faster than an ILP solver.}, selected = {true} } - TNSMHigh Throughput VMs Placement with Constrained Communication Overhead and Provable GuaranteesI. Cohen, G. Einziger, M. Goldstein, Y. Sa'ar, G. Scalosub, and E. WaisbardIEEE Transactions on Network and Service Management, 2023
Placement of VMs in the cloud is one of the most fundamental problems in systems research. Traditionally, placement algorithms assume that the schedulers have complete information about the currently available resources at each host. However, this assumption is in many cases unrealistic, as gathering fresh status information from each of the thousands of hosts in a large data center incurs excessive communication overhead, which results in long queueing delays. Efforts to resolve this problem by employing several parallel schedulers typically exhibit collisions when several schedulers are simultaneously trying to place VMs on the same host. Our work analyzes the performance of various placement algorithms and provides empirical evidence that using multiple randomized schedulers obtains high throughput, while significantly decreasing both the communication overhead, and the number of collisions between schedulers. We, therefore, introduce Adaptive Partial State Random (APSR) -- an efficient parallel random resource management algorithm that samples only from a small number of hosts and dynamically adjusts the degree of parallelism to provide provable guarantees on the probability of collisions between distinct schedulers. We formally analyze APSR, evaluate it on real workloads, and integrate it into the popular OpenStack cloud management platform. Our evaluation shows that APSR matches the throughput provided by other parallel schedulers, while achieving up to 13x lower decline ratio and a reduction of over 85\% in communication overheads.
@article{cohen2023highthroughput, bibtex_show = {true}, abbr = {TNSM}, themes = {edge_cloud}, title = {High Throughput VMs Placement with Constrained Communication Overhead and Provable Guarantees}, author = {Cohen, I. and Einziger, G. and Goldstein, M. and Sa'ar, Y. and Scalosub, G. and Waisbard, E.}, journal = {IEEE Transactions on Network and Service Management}, volume = {20}, number = {3}, pages = {3148--3161}, abstract = {Placement of VMs in the cloud is one of the most fundamental problems in systems research. Traditionally, placement algorithms assume that the schedulers have complete information about the currently available resources at each host. However, this assumption is in many cases unrealistic, as gathering fresh status information from each of the thousands of hosts in a large data center incurs excessive communication overhead, which results in long queueing delays. Efforts to resolve this problem by employing several parallel schedulers typically exhibit collisions when several schedulers are simultaneously trying to place VMs on the same host. Our work analyzes the performance of various placement algorithms and provides empirical evidence that using multiple randomized schedulers obtains high throughput, while significantly decreasing both the communication overhead, and the number of collisions between schedulers. We, therefore, introduce Adaptive Partial State Random (APSR) -- an efficient parallel random resource management algorithm that samples only from a small number of hosts and dynamically adjusts the degree of parallelism to provide provable guarantees on the probability of collisions between distinct schedulers. We formally analyze APSR, evaluate it on real workloads, and integrate it into the popular OpenStack cloud management platform. Our evaluation shows that APSR matches the throughput provided by other parallel schedulers, while achieving up to 13x lower decline ratio and a reduction of over 85\% in communication overheads.}, year = {2023} } - TCCVirtual Service Embedding with Time-Varying Load and Provable GuaranteesC. Chiasserini, G. Einziger, F. Malandrino, and G. ScalosubIEEE Transactions on Cloud Computing, 2023
Deploying services efficiently while satisfying their quality requirements is a major challenge in network slicing. Effective solutions place instances of the services' virtual network functions (VNFs) at different locations of the cellular infrastructure and manage such instances by scaling them as needed. In this work, we address the above problem and the very relevant aspect of sub-slice reuse among different services. Further, unlike prior art, we account for the services' finite lifetime and time-varying traffic load. We identify two major sources of inefficiency in service management: (i) the overspending of computing resources due to traffic of multiple services with different latency requirements being processed by the same virtual machine (VM), and (ii) the poor packing of traffic processing requests in the same VM, leading to opening more VMs than necessary. To cope with the above issues, we devise an algorithm, called REShare, that can dynamically adapt to the system's operational conditions and find an optimal trade-off between the aforementioned opposite requirements. We prove that REShare has low algorithmic complexity and is asymptotic 2-competitive under a non-decreasing load. Numerical results, leveraging real-world scenarios, show that our solution outperforms alternatives, swiftly adapting to time-varying conditions and reducing service cost by over 25\%.
@article{chiasserini2023virtual, bibtex_show = {true}, abbr = {TCC}, themes = {edge_cloud}, title = {Virtual Service Embedding with Time-Varying Load and Provable Guarantees}, author = {Chiasserini, C.-F. and Einziger, G. and Malandrino, F. and Scalosub, G.}, journal = {IEEE Transactions on Cloud Computing}, volume = {11}, number = {3}, pages = {2693--2710}, abstract = {Deploying services efficiently while satisfying their quality requirements is a major challenge in network slicing. Effective solutions place instances of the services' virtual network functions (VNFs) at different locations of the cellular infrastructure and manage such instances by scaling them as needed. In this work, we address the above problem and the very relevant aspect of sub-slice reuse among different services. Further, unlike prior art, we account for the services' finite lifetime and time-varying traffic load. We identify two major sources of inefficiency in service management: (i) the overspending of computing resources due to traffic of multiple services with different latency requirements being processed by the same virtual machine (VM), and (ii) the poor packing of traffic processing requests in the same VM, leading to opening more VMs than necessary. To cope with the above issues, we devise an algorithm, called REShare, that can dynamically adapt to the system's operational conditions and find an optimal trade-off between the aforementioned opposite requirements. We prove that REShare has low algorithmic complexity and is asymptotic 2-competitive under a non-decreasing load. Numerical results, leveraging real-world scenarios, show that our solution outperforms alternatives, swiftly adapting to time-varying conditions and reducing service cost by over 25\%.}, year = {2023} }
2021
- IFIP NetworkingParallel VM Deployment with Provable GuaranteesI. Cohen, G. Einziger, M. Goldstein, Y. Sa'ar, G. Scalosub, and E. WaisbardProceedings of IFIP Networking 2021, the 20th International Federation for Information Processing (IFIP) Networking Conference, 2021
Network Function Virtualization (NFV) carries the potential for on-demand deployment of network algorithms in virtual machines (VMs). In large clouds, however, VM resource allocation incurs delays that hinder the dynamic scaling of such NFV deployment. Parallel resource management is a promising direction for boosting performance, but it may significantly increase the communication overhead and the decline ratio of deployment attempts. Our work analyzes the performance of various placement algorithms and provides empirical evidence that state of the art parallel resource management dramatically increases the decline ratio of deterministic algorithms, but hardly affects randomized algorithms. We therefore introduce APSR -- an efficient parallel random resource management algorithm that requires information only from a small number of hosts and dynamically adjusts the degree of parallelism to provide provable decline ratio guarantees. We formally analyze APSR, evaluate it on real workloads, and integrate it into the popular OpenStack cloud management platform. Our evaluation shows that APSR matches the throughput provided by other parallel schedulers, while achieving up to 13x lower decline ratio and a reduction of over 85% in communication overheads.
@inproceedings{cohen2021parallel, bibtex_show = {true}, abbr = {IFIP Networking}, themes = {edge_cloud}, title = {Parallel VM Deployment with Provable Guarantees}, author = {Cohen, I. and Einziger, G. and Goldstein, M. and Sa'ar, Y. and Scalosub, G. and Waisbard, E.}, booktitle = {Proceedings of IFIP Networking 2021, the 20th International Federation for Information Processing (IFIP) Networking Conference}, year = {2021}, abstract = {Network Function Virtualization (NFV) carries the potential for on-demand deployment of network algorithms in virtual machines (VMs). In large clouds, however, VM resource allocation incurs delays that hinder the dynamic scaling of such NFV deployment. Parallel resource management is a promising direction for boosting performance, but it may significantly increase the communication overhead and the decline ratio of deployment attempts. Our work analyzes the performance of various placement algorithms and provides empirical evidence that state of the art parallel resource management dramatically increases the decline ratio of deterministic algorithms, but hardly affects randomized algorithms. We therefore introduce APSR -- an efficient parallel random resource management algorithm that requires information only from a small number of hosts and dynamically adjusts the degree of parallelism to provide provable decline ratio guarantees. We formally analyze APSR, evaluate it on real workloads, and integrate it into the popular OpenStack cloud management platform. Our evaluation shows that APSR matches the throughput provided by other parallel schedulers, while achieving up to 13x lower decline ratio and a reduction of over 85% in communication overheads.}, additional_info = {Poster appeared in IEEE INFOCOM 2020.} }
2019
- ToNReducing Service Deployment Cost Through VNF SharingF. Malandrino, C. Chiasserini, G. Einziger, and G. ScalosubIEEE/ACM Transactions on Networking, 2019
Thanks to its computational and forwarding capabilities, the mobile network infrastructure can support several third-party (``vertical'') services, each composed of a graph of virtual (network) functions (VNFs). Importantly, one or more VNFs are often common to multiple services, thus the services deployment cost could be reduced by letting the services share the same VNF instance instead of devoting a separate instance to each service. By doing that, however, it is critical that the target KPI (key performance indicators) of all services are met. To this end, we study the VNF sharing problem and make decisions on 1) when sharing VNFs among multiple services is possible, 2) how to adapt the virtual machines running the shared VNFs to the combined load of the assigned services, and 3) how to prioritize the services traffic within shared VNFs. All decisions aim to minimize the cost for the mobile operator, subject to requirements on end-to-end service performance, e.g., total delay. Notably, we show that the aforementioned priorities should be managed dynamically and vary across VNFs. We then propose the FlexShare algorithm to provide near-optimal VNF-sharing and priority assignment decisions in polynomial time. We prove that FlexShare is within a constant factor from the optimum and, using real-world VNF graphs, we show that it consistently outperforms baseline solutions.
@article{malandrino2019reducing, bibtex_show = {true}, abbr = {ToN}, themes = {edge_cloud}, title = {Reducing Service Deployment Cost Through VNF Sharing}, author = {Malandrino, F. and Chiasserini, C.-F. and Einziger, G. and Scalosub, G.}, journal = {IEEE/ACM Transactions on Networking}, volume = {27}, number = {6}, pages = {2363--2376}, abstract = {Thanks to its computational and forwarding capabilities, the mobile network infrastructure can support several third-party (``vertical'') services, each composed of a graph of virtual (network) functions (VNFs). Importantly, one or more VNFs are often common to multiple services, thus the services deployment cost could be reduced by letting the services share the same VNF instance instead of devoting a separate instance to each service. By doing that, however, it is critical that the target KPI (key performance indicators) of all services are met. To this end, we study the VNF sharing problem and make decisions on 1) when sharing VNFs among multiple services is possible, 2) how to adapt the virtual machines running the shared VNFs to the combined load of the assigned services, and 3) how to prioritize the services traffic within shared VNFs. All decisions aim to minimize the cost for the mobile operator, subject to requirements on end-to-end service performance, e.g., total delay. Notably, we show that the aforementioned priorities should be managed dynamically and vary across VNFs. We then propose the FlexShare algorithm to provide near-optimal VNF-sharing and priority assignment decisions in polynomial time. We prove that FlexShare is within a constant factor from the optimum and, using real-world VNF graphs, we show that it consistently outperforms baseline solutions.}, year = {2019} }
▸ Cloud & Microarchitectural Security
Security implications of co-location and resource sharing in virtualized and cloud environments.
More recently, I've been working on cyber security with an emphasis on microarchitectural attacks and virtualized environments, including the security implications of service placement and co-location decisions in shared cloud infrastructure.2026
- uASC$\mu$-ops, I Did it Again: A Second Look at Port Assignment on Intel CPUs (WiP)Y. Oziel, T. Laor, S. Levy, C. Maurice, Y. Oren, T. Rokicki, and G. ScalosubProceedings of the 2nd Microarchitecture Security Conference (uASC), 2026
Modern Intel CPUs use a superscalar, out-of-order execution (OoOE) pipeline that maximizes instruction throughput through instruction-level parallelism (ILP). A critical component affecting system performance is the process in which $\mu$-ops are mapped to execution ports, which lays at the core of the ability to perform OoOE. Significant effort has been undertaken in the attempt to reverse-engineer and model this process. Our work revisits this question from a security perspective, focusing on potential vulnerabilities stemming from $\mu$-op port assignment and execution. Using carefully designed code gadgets, we expose behaviors that contradict state-of-the-art models. In this Work In Progress, we ask: Which undocumented behaviors in Intel's port assignment algorithm affect instruction scheduling? By extending the micro-benchmarks used by Abel and Reineke, we identify corner cases that strongly deviates from the common case, where the CPU adopts at least three different port assignment strategies. These depend not only on the instruction and the microarchitectural generation, but, surprisingly, also on immediate operands.
@inproceedings{oziel26uops, bibtex_show = {true}, abbr = {uASC}, themes = {security}, title = {$\mu$-ops, I Did it Again: A Second Look at Port Assignment on Intel CPUs (WiP)}, author = {Oziel, Y. and Laor, T. and Levy, S. and Maurice, C. and Oren, Y. and Rokicki, T. and Scalosub, G.}, booktitle = {Proceedings of the 2nd Microarchitecture Security Conference (uASC)}, year = {2026}, abstract = {Modern Intel CPUs use a superscalar, out-of-order execution (OoOE) pipeline that maximizes instruction throughput through instruction-level parallelism (ILP). A critical component affecting system performance is the process in which $\mu$-ops are mapped to execution ports, which lays at the core of the ability to perform OoOE. Significant effort has been undertaken in the attempt to reverse-engineer and model this process. Our work revisits this question from a security perspective, focusing on potential vulnerabilities stemming from $\mu$-op port assignment and execution. Using carefully designed code gadgets, we expose behaviors that contradict state-of-the-art models. In this Work In Progress, we ask: Which undocumented behaviors in Intel's port assignment algorithm affect instruction scheduling? By extending the micro-benchmarks used by Abel and Reineke, we identify corner cases that strongly deviates from the common case, where the CPU adopts at least three different port assignment strategies. These depend not only on the instruction and the microarchitectural generation, but, surprisingly, also on immediate operands.}, selected = {true} }
